# VPN & Remote Access Architecture
Last updated: 2026-04-10
## Overview
Remote access to the homelab is provided through a hybrid VPN architecture: WireGuard site-to-site tunnels connect physical locations (Sofia, London, Valchedrym), while Headscale (self-hosted Tailscale control server) provides mesh overlay networking for roaming clients. Split DNS architecture ensures resilience: AdGuard serves as the global DNS resolver for all VPN clients, while Technitium handles internal `.lan` domains. This design prevents tunnel dependency for public DNS resolution — if the Cloudflared tunnel goes down, clients can still access the internet.
## Architecture Diagram
### VPN Topology
```mermaid
graph TB
subgraph "Site-to-Site WireGuard (Hub-and-Spoke)"
Sofia[Sofia pfSense
10.3.2.1
tun_wg0]
London[London GL-iNet Flint 2
10.3.2.6
192.168.8.0/24]
Valchedrym[Valchedrym OpenWRT
10.3.2.5
192.168.0.0/24]
Sofia ---|WireGuard Tunnel| London
Sofia ---|WireGuard Tunnel| Valchedrym
end
subgraph "Headscale Mesh Overlay"
HS[Headscale
headscale.viktorbarzin.me
K8s Service]
Authentik[Authentik OIDC
SSO Login]
DERP[DERP Relay
Region 999
Embedded in Headscale]
subgraph "Clients"
Laptop[MacBook
Tailscale Client]
Phone[iPhone
Tailscale Client]
Remote[Remote VM
Tailscale Client]
end
HS --> Authentik
HS --> DERP
Laptop -.mesh.- Phone
Laptop -.mesh.- Remote
Phone -.mesh.- Remote
Laptop --> HS
Phone --> HS
Remote --> HS
Laptop -.relay fallback.- DERP
Phone -.relay fallback.- DERP
end
Sofia --> HS
```
### DNS Resolution Flow
```mermaid
sequenceDiagram
participant Client as VPN Client
participant AdGuard as AdGuard DNS
(Global)
participant Technitium as Technitium DNS
(Internal .lan)
participant Cloudflare as Cloudflare DNS
(Public Domains)
Note over Client: Query: immich.viktorbarzin.me
Client->>AdGuard: DNS query
AdGuard->>Cloudflare: Forward (not .lan)
Cloudflare-->>AdGuard: A record (Cloudflare IP)
AdGuard-->>Client: Response
Note over Client: Query: truenas.viktorbarzin.lan
Client->>AdGuard: DNS query
AdGuard->>Technitium: Forward (.lan domain)
Technitium-->>AdGuard: A record (10.0.10.15)
AdGuard-->>Client: Response
Note over Client,Technitium: If Cloudflared tunnel is down:
Client->>AdGuard: DNS query (google.com)
AdGuard->>Cloudflare: Forward (public DNS works)
Cloudflare-->>AdGuard: A record
AdGuard-->>Client: Response (no tunnel dependency)
```
## Components
| Component | Version/Type | Location | Purpose |
|-----------|-------------|----------|---------|
| WireGuard | Built-in (pfSense/OpenWRT) | Sofia (pfSense), London (GL-iNet Flint 2), Valchedrym (OpenWRT) | Site-to-site encrypted tunnels (hub-and-spoke) |
| Headscale | v0.23.x (container) | K8s (headscale.viktorbarzin.me) | Tailscale control server, mesh coordinator |
| Tailscale | Client v1.x | User devices | Mesh VPN client |
| Authentik | OIDC provider | K8s | SSO authentication for Headscale |
| DERP Relay | Embedded in Headscale | K8s (region 999) | Relay for NAT traversal |
| AdGuard DNS | Container | K8s | Global DNS resolver with ad-blocking |
| Technitium DNS | Container | K8s (10.0.20.101) | Internal .lan domain resolver |
## How It Works
### WireGuard Site-to-Site
Three physical locations are permanently connected via WireGuard in a **hub-and-spoke** topology with Sofia as the hub. A single WireGuard interface (`tun_wg0`) on pfSense carries both peers on the `10.3.2.0/24` tunnel subnet:
- **Sofia** (hub): `10.3.2.1` — pfSense, K8s cluster on `10.0.20.0/24`, management on `10.0.10.0/24`, LAN on `192.168.1.0/24`
- **London** (spoke): `10.3.2.6` — GL-iNet Flint 2 (GL-MT6000), LAN `192.168.8.0/24`, guest `192.168.9.0/24`
- **Valchedrym** (spoke): `10.3.2.5` — OpenWRT router, LAN `192.168.0.0/24`
Routes are configured as static routes on pfSense. London and Valchedrym route Sofia-bound traffic through their WireGuard tunnels. London ↔ Valchedrym traffic transits through Sofia (no direct tunnel).
**Use cases**:
- Replication of Vault data between Sofia and London
- Offsite database replicas
- Accessing Proxmox hosts across locations
### Headscale Mesh Overlay
Headscale is a self-hosted alternative to Tailscale's commercial control plane. It provides:
- **Mesh networking**: Clients establish direct WireGuard connections to each other (peer-to-peer).
- **NAT traversal**: DERP relays provide connectivity when direct connections fail.
- **OIDC authentication**: Users log in via Authentik, no pre-shared keys.
- **ACL policies**: Fine-grained control over which clients can reach which destinations.
**Client onboarding**:
1. User installs Tailscale client (official macOS/iOS/Android app)
2. Runs: `tailscale login --login-server https://headscale.viktorbarzin.me`
3. Browser opens to Authentik SSO login
4. After successful login, Tailscale presents a registration URL
5. Admin approves the device via `headscale nodes register --user --key `
6. Client is added to the mesh, receives IP in 100.64.0.0/10 range
**Connectivity test**: `ping 10.0.20.100` (Sofia K8s API server) verifies full access to the homelab network.
### DERP Relay for NAT Traversal
**Problem**: Symmetric NAT or restrictive firewalls prevent direct WireGuard connections between clients.
**Solution**: Headscale runs an embedded DERP relay server (region 999, named "Home DERP"). DERP is Tailscale's NAT traversal protocol, implemented as an HTTPS-based relay.
**How it works**:
1. Clients attempt direct WireGuard connection via STUN/ICE.
2. If direct connection fails, both clients connect to the DERP relay via HTTPS.
3. Traffic is encrypted end-to-end with WireGuard, DERP only relays packets.
4. No additional ports needed — DERP uses the same HTTPS ingress as Headscale (443).
**Performance**: DERP adds latency (extra hop through Sofia K8s cluster), but ensures connectivity in all scenarios.
### Split DNS Architecture
**Design goal**: Prevent tunnel dependency for public DNS resolution. If the Headscale tunnel or Cloudflared tunnel fails, clients must still resolve public domains.
**Implementation**:
- **AdGuard DNS**: Global recursive resolver, serves all VPN clients. Includes ad-blocking and malicious domain filtering.
- **Technitium DNS**: Internal authoritative server for `.viktorbarzin.lan` domains.
**Resolution flow**:
1. Client queries AdGuard for any domain.
2. If domain ends in `.lan`, AdGuard forwards to Technitium (10.0.20.201).
3. For all other domains, AdGuard resolves directly via upstream (Cloudflare 1.1.1.1).
4. AdGuard caches responses, reducing load on Technitium and upstream.
**Resilience**: Even if the tunnel to Sofia is down, clients can still resolve `google.com`, `github.com`, etc., because AdGuard talks directly to Cloudflare. Only `.lan` domains become unavailable.
### Access Control (Authentik Groups)
**Headscale Users** group in Authentik controls VPN access. Membership is invitation-only:
1. Admin creates user in Authentik.
2. Admin adds user to "Headscale Users" group.
3. User logs in via OIDC during `tailscale login`.
4. Headscale verifies group membership via OIDC claims.
Removing a user from the group revokes VPN access on next re-authentication (every 30 days).
## Configuration
### Terraform Stacks
| Stack | Path | Resources |
|-------|------|-----------|
| Headscale | `stacks/headscale/` | Deployment, Service, Ingress, ConfigMap |
| AdGuard | `stacks/adguard/` | Deployment, Service, PVC |
| Technitium | `stacks/technitium/` | Deployment, Service, PVC |
| pfSense (Sofia) | Not in Terraform | WireGuard tunnel configs (managed via pfSense UI) |
### Headscale Configuration
**ConfigMap**: `stacks/headscale/main.tf`
```yaml
server_url: https://headscale.viktorbarzin.me
listen_addr: 0.0.0.0:8080
metrics_listen_addr: 0.0.0.0:9090
oidc:
issuer: https://authentik.viktorbarzin.me/application/o/headscale/
client_id:
client_secret:
scope: ["openid", "profile", "email", "groups"]
allowed_groups: ["Headscale Users"]
derp:
server:
enabled: true
region_id: 999
region_code: "home"
region_name: "Home DERP"
stun_listen_addr: "0.0.0.0:3478"
urls:
- https://controlplane.tailscale.com/derpmap/default
auto_update_enabled: true
update_frequency: 24h
ip_prefixes:
- 100.64.0.0/10
dns_config:
nameservers:
- 10.0.20.102 # AdGuard DNS
domains:
- viktorbarzin.lan
magic_dns: true
```
**Secrets (Vault)**:
- `secret/headscale/oidc_client_secret`
**Ingress**: Standard `ingress_factory` with `protected = false` (OIDC is handled by Headscale itself).
### AdGuard Configuration
**Upstream DNS servers**:
- Cloudflare: `1.1.1.1`, `1.0.0.1`
- Google: `8.8.8.8`, `8.8.4.4`
**Conditional forwarding**:
- `viktorbarzin.lan` → `10.0.20.101` (Technitium)
**Ad-blocking lists**:
- AdGuard DNS filter
- OISD full list
- Developer Dan's ads and tracking list
**Custom rules**: Block telemetry for Windows, macOS, and smart TVs.
### WireGuard (pfSense — Hub)
**Single interface `tun_wg0`** (OPT2) with two peers on subnet `10.3.2.0/24`. Listens on `*:51821` for both IPv4 and IPv6. IPv6 access via HE tunnel (`gif0`, `2001:470:6e:43d::2`) requires a `pass in` pf rule on the `HE_IPv6` interface (interface name `opt3` in config.xml):
**Peer: London Flint 2**:
- WireGuard IP: `10.3.2.6`
- Remote endpoint: `vpn.viktorbarzin.me:51821` (dual-stack: A=176.12.22.76, AAAA=2001:470:6e:43d::2)
- Allowed IPs: `192.168.8.0/24, 192.168.9.0/24, 192.168.10.0/24, 10.3.2.6/32`
- Keepalive: 25 seconds (configured on London side)
**Peer: Valchedrym**:
- WireGuard IP: `10.3.2.5`
- Remote endpoint: `85.130.41.28:51820`
- Allowed IPs: `10.3.2.5/32, 192.168.0.0/24`
- Keepalive: none (should be added)
**Static routes on pfSense**:
- `192.168.0.0/24` → gateway `valchedrym` (10.3.2.5)
- `192.168.8.0/24` → gateway `london_flint_2` (10.3.2.6)
- `192.168.9.0/24` → gateway `london_flint_2` (10.3.2.6)
- `192.168.10.0/24` → gateway `london_flint_2` (10.3.2.6)
**Note**: WireGuard on pfSense is NOT managed by Terraform — configured via pfSense UI/shell.
### WireGuard (London — GL-iNet Flint 2)
- Interface: `wgclient1` (proto `wgclient`, config `peer_855`)
- Local IP: `10.3.2.6/32`
- Remote endpoint: `vpn.viktorbarzin.me:51821` (dual-stack — resolves to IPv4 or IPv6)
- Allowed IPs: `10.0.0.0/8, 192.168.1.0/24, 192.168.0.0/24`
- Keepalive: 25 seconds
- Policy routing: GL-iNet marks traffic via iptables mangle → routing table 1001 (ipset `dst_net10`)
- Persistence: `/etc/firewall.user` injects LOCAL_POLICY mangle rule (GL-iNet's `gl-tertf` creates TUNNEL10_ROUTE_POLICY but not the LOCAL_POLICY rule for router-originated traffic)
**GL-iNet AllowedIPs format**: UCI `list allowed_ips` entries are concatenated by the `wgclient` protocol handler. Use a **single comma-separated entry** (`'10.0.0.0/8,192.168.1.0/24,192.168.0.0/24'`), NOT multiple list entries. Multiple entries cause a parse error like `10.0.0.0/8192.168.1.0/24` (no separator).
**DNS**: AdGuardHome runs on the router. Upstream DNS should NOT include `1.1.1.1` — it creates conntrack conflicts with ICMP and GL-iNet's `carrier-monitor` health check floods Cloudflare, triggering ICMP rate limits. Use `9.9.9.9`, `8.8.4.4` instead. Health check IPs (`glconfig.general.track_ip`) should use `1.0.0.1` not `1.1.1.1`.
### WireGuard (Valchedrym — OpenWRT)
- WireGuard IP: `10.3.2.5`
- Remote endpoint: Sofia public IP
- LAN: `192.168.0.0/24`
### Vault Secrets
- Headscale OIDC client secret: `secret/headscale/oidc_client_secret`
- WireGuard private keys: `secret/pfsense/wg_privkey_london`, `secret/pfsense/wg_privkey_valchedrym`
## Decisions & Rationale
### Why Headscale Instead of Plain WireGuard?
**Alternatives considered**:
1. **WireGuard with static configs**: Requires manual key distribution, complex peer management.
2. **OpenVPN**: Slower, more overhead, less mobile-friendly.
3. **Commercial Tailscale**: SaaS, not self-hosted, less control over data.
**Decision**: Headscale provides:
- **Mesh networking**: Clients connect directly, not through a central server.
- **OIDC authentication**: No pre-shared keys, integrates with existing SSO.
- **Easy onboarding**: Users install official Tailscale app, no custom configs.
- **Self-hosted**: Full control over control plane and data.
**Trade-off**: More complex setup than plain WireGuard, but operational benefits outweigh initial complexity.
### Why Split DNS (AdGuard + Technitium)?
**Alternatives considered**:
1. **Single DNS server (Technitium only)**: Requires forwarding all public domains to upstream, creating single point of failure.
2. **Cloudflare only**: Fast, but no internal `.lan` domain support without zone delegation.
3. **Tailscale MagicDNS only**: Depends on Headscale control plane, fails if control plane is down.
**Decision**: Split DNS architecture provides:
- **Resilience**: If Headscale tunnel fails, public DNS still works via AdGuard → Cloudflare.
- **Ad-blocking**: AdGuard filters ads and malicious domains for all VPN clients.
- **Internal domains**: Technitium authoritatively serves `.lan`, no external dependency.
**Key benefit**: Zero tunnel dependency for public DNS. Users can browse the internet even if the homelab is completely offline.
### Why Embedded DERP Relay?
**Alternatives considered**:
1. **External DERP relays only (Tailscale's public relays)**: Free, but adds latency and exposes traffic metadata to Tailscale.
2. **No DERP, direct connections only**: Fails for symmetric NAT clients (mobile networks).
**Decision**: Embedded DERP (region 999) provides:
- **Privacy**: All relay traffic stays within the homelab.
- **Reliability**: Not dependent on Tailscale's public infrastructure.
- **No extra ports**: DERP uses HTTPS (443), same as Headscale API.
**Trade-off**: Adds CPU/memory overhead to Headscale pod, but minimal compared to benefits.
### Why OIDC Authentication Instead of Pre-Authorized Keys?
**Alternatives considered**:
1. **Pre-authorized keys**: Headscale generates keys, admin shares with users.
2. **Shared secret**: Single password for all users.
**Decision**: OIDC via Authentik provides:
- **Centralized access control**: Add/remove users in one place.
- **Audit trail**: Authentik logs all login attempts.
- **Group-based authorization**: Only "Headscale Users" group can access VPN.
- **SSO integration**: Users already have accounts in Authentik for other services.
**Key workflow**: Admin invites user → user logs in via Authentik → admin approves device → access granted. No key exchange needed.
## Troubleshooting
### Headscale Login Fails (OIDC Error)
**Symptoms**: `tailscale login --login-server` opens browser, but after Authentik login, shows "OIDC error: invalid state".
**Diagnosis**: Check Headscale logs: `kubectl logs -n headscale deploy/headscale`
**Common causes**:
1. **Client clock skew**: OIDC tokens have short validity (5 minutes). Ensure client's system time is accurate.
2. **Callback URL mismatch**: Authentik application must have `https://headscale.viktorbarzin.me/oidc/callback` in Redirect URIs.
3. **Group membership**: User is not in "Headscale Users" group in Authentik.
**Fix**: Sync system clock, verify Authentik application config, add user to group.
### Direct Connection Fails, Traffic Goes via DERP
**Symptoms**: `tailscale status` shows `relay "home"` instead of direct connection. Higher latency.
**Diagnosis**: Check DERP usage: `tailscale netcheck`
**Common causes**:
1. **Symmetric NAT**: Mobile networks or restrictive corporate firewalls block UDP hole-punching.
2. **Firewall blocking WireGuard**: Port 51820 UDP blocked on one or both clients.
3. **STUN failure**: Can't determine external IP and port.
**Fix**: This is expected behavior in many environments. DERP relay ensures connectivity. If latency is unacceptable, use site-to-site WireGuard instead.
### Can't Resolve .lan Domains from VPN
**Symptoms**: `nslookup truenas.viktorbarzin.lan` returns `NXDOMAIN`.
**Diagnosis**: Check DNS chain: Client → AdGuard → Technitium.
**Steps**:
1. Verify AdGuard is running: `kubectl get pod -n adguard`
2. Check AdGuard conditional forwarding: Query AdGuard directly: `nslookup truenas.viktorbarzin.lan `
3. Check Technitium: `nslookup truenas.viktorbarzin.lan 10.0.20.101`
**Common causes**:
1. **AdGuard not forwarding .lan**: Conditional forwarding rule missing or misconfigured.
2. **Technitium down**: Pod crash-looping or PVC corrupted.
3. **DNS propagation delay**: Technitium zone update not yet applied.
**Fix**: Verify conditional forwarding in AdGuard UI. Restart Technitium if needed. Check zone file in Technitium UI.
### VPN Client Can't Reach K8s Services
**Symptoms**: Can `ping 10.0.20.1` (pfSense), but `curl https://immich.viktorbarzin.me` times out.
**Diagnosis**: Check connectivity at each layer:
1. **DNS**: Does `nslookup immich.viktorbarzin.me` return correct IP?
2. **Routing**: Can client reach MetalLB IP? `ping `
3. **Firewall**: Is pfSense blocking traffic from VPN subnet?
**Common causes**:
1. **Split DNS working too well**: Client resolves to Cloudflare IP instead of internal LAN IP. Expected for proxied domains — use direct domain (e.g., `immich-direct.viktorbarzin.me`).
2. **ACL policy**: Headscale ACL blocks client from accessing certain subnets.
3. **pfSense NAT rule missing**: Traffic from VPN subnet not routed to VLAN 20.
**Fix**: For proxied domains, use non-proxied DNS names. Check Headscale ACL policy. Verify pfSense NAT rules.
### DERP Relay Returns 502 Bad Gateway
**Symptoms**: Tailscale clients can't connect, DERP shows offline in `tailscale netcheck`.
**Diagnosis**: Check Headscale ingress: `kubectl get ingress -n headscale`
**Common causes**:
1. **Traefik middleware blocking DERP traffic**: Forward-auth interferes with WebSocket upgrade.
2. **Headscale pod not ready**: Liveness probe failing.
3. **Cloudflared tunnel issue**: DERP uses WebSockets, which require HTTP/1.1 upgrade support.
**Fix**: Ensure Headscale ingress has `protected = false` (no forward-auth). Check Headscale pod readiness. Verify Cloudflared supports WebSocket upgrades.
### WireGuard Site-to-Site Tunnel Disconnects
**Symptoms**: Can't reach services in London from Sofia. `ping 192.168.8.1` fails.
**Diagnosis**: Check pfSense WireGuard status via `pfsense.py wireguard` or Dashboard → VPN → WireGuard → Status
**Common causes**:
1. **AllowedIPs parse error on GL-iNet**: If `wg show wgclient1` shows no peers and interface is DOWN with `qdisc noop`, check `/etc/config/wireguard` peer config. AllowedIPs must be a single comma-separated entry, not multiple `list` entries (see London section above).
2. **IPv6 endpoint resolution**: If IPv4 is down, DNS resolves to IPv6 (AAAA record). Ensure the pfSense `HE_IPv6` (gif0) interface has a `pass in` rule for UDP 51821.
3. **Keepalive packets dropped**: Firewall or ISP blocking UDP 51821.
4. **Public IP changed**: Dynamic IP on remote site changed, config still has old IP.
5. **GL-iNet policy routing lost**: After firewall reload, check if `TUNNEL10_ROUTE_POLICY` and `LOCAL_POLICY` mangle rules exist. If not, run `/etc/init.d/firewall restart` and check `/etc/firewall.user` execution.
6. **Kill switch active**: If WG interface is DOWN, table 1001 only has blackhole routes → all marked traffic dropped → IPv4 internet broken.
**Fix**: Check `wg show wgclient1` on London router. If no peers, fix AllowedIPs format and `ifdown/ifup wgclient1`. Verify handshake with `ping 10.3.2.1`.
## Related
- **Runbooks**:
- `docs/runbooks/add-headscale-user.md`
- `docs/runbooks/reset-derp-relay.md`
- `docs/runbooks/update-wireguard-peer.md`
- **Architecture Docs**:
- `docs/architecture/networking.md` — Core network architecture
- `docs/architecture/dns.md` — Full DNS architecture (coming soon)
- **Reference**:
- `.claude/reference/authentik-state.md` — OIDC application configs
- `.claude/reference/service-catalog.md` — Full service inventory