infra/docs/plans/2026-06-08-matrix-synapse-to-tuwunel-plan.md
Viktor Barzin fd0f4a0365 fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip]
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 08:45:33 +00:00

5.6 KiB
Raw Blame History

Matrix: Synapse → tuwunel migration — Plan (executed)

Date: 2026-06-08 · Companion: 2026-06-08-matrix-synapse-to-tuwunel-design.md

Executed steps

  1. Vault — generated a 32-byte registration_token, stored at secret/matrix.
  2. stacks/matrix rewrite — replaced Synapse with tuwunel: removed the matrix-db-creds ExternalSecret, both init-containers (install-psycopg2, inject-db-password), the extra-packages volume, and the Reloader annotation; added the matrix-secrets ExternalSecret (vault-kv dataFrom), the TUWUNEL_* env, securityContext 1000, and the tuwunel image. Encrypted PVC, Service (80→8008), and ingress (auth="none", proxied) unchanged.
    • The image is in the deployment's ignore_changes (KEEL_IGNORE_IMAGE); it was temporarily un-ignored for this base-image swap, then re-added at step 4 so Keel resumes tag management.
    • tg init -reconfigure was required first (Tier-1 PG-backend creds rotate weekly → "Backend configuration block has changed").
  3. ApplyPlan: 1 to add, 2 to change, 1 to destroy. tuwunel 1.7.1 came up 1/1, created a fresh RocksDB on the encrypted PVC (no permission errors — fsGroup worked).
  4. Verify — all 200: /_tuwunel/server_version, .well-known/matrix/ {client,server}, /_matrix/client/versions, /_matrix/federation/v1/version. Registered @viktor:matrix.viktorbarzin.me (first user → admin) via the token flow; whoami confirmed. Creds stored at secret/matrix (admin_user, admin_password).
  5. Lock downTUWUNEL_ALLOW_REGISTRATION=false + re-added image ignore_changes; applied. Registration now returns 403 M_FORBIDDEN.
  6. Cleanup
    • stacks/vault: removed the pg_matrix static role + its allowed_roles entry (targeted apply — the full plan also wanted an unrelated OIDC tune-TTL change, deliberately NOT applied; see residual items).
    • Dropped the orphaned matrix Postgres DB (16 MB) + matrix role on the CNPG primary (pg-cluster-2).
    • Docs updated: .claude/CLAUDE.md (PG-rotation list), service-catalog.md, upgrade-config.json (removed synapse image-rename + matrix PG entry), authentication.md + authentik-state.md (Matrix OIDC → orphaned).

Rollback

Fresh start was confirmed, so there is no Synapse data to preserve. To revert the service: restore the Synapse main.tf from git, re-add the pg_matrix Vault role, and restore the matrix Postgres DB from the daily per-db dump (/backup/per-db/matrix/). The reused encrypted PVC still holds Synapse's old homeserver.yaml / signing key / media at the volume root alongside the new RocksDB dir.

Residual / follow-up items (flagged to user)

  • Authentik Matrix OAuth2 app — REMOVED 2026-06-08 (user-confirmed). It was UI-managed (NOT in the authentik TF stack), so it was deleted via the Authentik API: application matrix + OAuth2 provider pk=6. tuwunel uses native password auth, so nothing consumed it.
  • Pre-existing drift in stacks/vault: vault_jwt_auth_backend.oidc shows a tune diff (explicit 768h default/max lease TTLs being dropped). This predates this migration and was not applied. Resolve separately.
  • Synapse leftover files remain on the encrypted PVC volume root (unused by tuwunel). Can be rm'd after confidence in the new server.

Follow-up: open registration + bot mitigations (2026-06-08, user-chosen)

Registration was opened fully (tokenless)TUWUNEL_ALLOW_REGISTRATION=true

  • TUWUNEL_YES_I_AM_VERY_VERY_SURE_I_WANT_AN_OPEN_REGISTRATION_SERVER_PRONE_TO_ABUSE=true, dropped the TUWUNEL_REGISTRATION_TOKEN env (the Vault secret/matrix token + matrix-secrets ESO are kept for one-env-change revert to token-gated). tuwunel has no CAPTCHA (only Synapse does) and a browser challenge would break native clients, so bot defense is layered instead:
  • Traefik rate-limit on /register — a register-ratelimit Middleware (stacks/matrix) on a path-scoped ingress_register carve-out (longer prefix wins over the catch-all). Keyed on the request Host (global /register cap), not source IP — because the host is reachable both via Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE tunnel → pfSense HAProxy → Traefik, no CF header); a per-source key let IPv6 bots bypass entirely (found during testing). 10/min, burst 20, per Traefik replica (×3).
  • CrowdSec (already on the ingress chain) is the hard backstop — bans abusive IPs on both paths; covers the per-replica looseness of the soft rate-limit.
  • Notification: Loki ruler rule MatrixNewUserRegistered (stacks/monitoring, matches ... registered on this server, never the rejection line) → lane=security → existing #security Slack receiver. Also note tuwunel's admin bot (@conduit:matrix.viktorbarzin.me) natively posts every registration to the server admin room, so there's an in-Matrix notice too.
  • Verification: open signup returns 200 (@regtest1, since deactivated via !admin users deactivate in the admin room); Traefik access logs confirm /register routes through the rate-limited carve-out router. A live 429 was not force-tested (per-replica burst ~60 across 3 replicas; avoided hammering so as not to trip CrowdSec on the test source IP).

Add a user: anyone can self-register now. To provision manually instead: !admin users create-user <name> in the admin room (first user @viktor is admin). Revert to token-gated: drop the YES_I_AM... flag, re-add TUWUNEL_REGISTRATION_TOKEN.