## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete rewrite of the user-facing documentation:
- How to report outages and request features
- Mermaid flow diagrams for both incident and feature request paths
- SLA expectations (automated vs human response times)
- Self-service checks before reporting
- Severity level definitions
- Status page explanation
- Full technical architecture section with component inventory
- Safety guardrails, labels, and commit conventions
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the centralized Beads/Dolt task tracking system used by all
Claude Code sessions. Covers architecture, session lifecycle, settings
hierarchy, known issues, and E2E test verification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)
Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.
Updated restore runbooks with per-database restore procedures.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds "Reporting an Issue" section with:
- Where to report (Slack, GitHub, DM)
- What to include (examples of good vs bad reports)
- What happens after reporting (flow diagram)
- Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit
SHAs. Flag 3 Migration TODOs as needing human review.
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
Added Uptime Kuma TCP monitor for PVE NFS (192.168.1.127:2049), ID 328,
Tier 1 (30s/3 retries). Investigation TODO flagged for human review.
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
Key additions:
- NFSv3 broke after NFS restart (kernel lockd bug on PVE 6.14)
- All 52 PVs migrated to NFSv4, NFSv3 disabled on PVE
- DNS zone sync gap: secondary/tertiary had no custom zones
- Converted one-time setup Job to recurring zone-sync CronJob
- MySQL, Redis, Vault collateral damage and fixes
- 3 new lessons learned (zone replication, NFS client state, operator rollout)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move HTML post-mortems from repo root post-mortems/ to docs/post-mortems/.
Update index.html with all 3 incidents (newest first).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consolidate all outage reports under docs/ for better discoverability.
Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/
(repo documentation).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inbound:
- Direct MX to mail.viktorbarzin.me (ForwardEmail relay attempted and abandoned)
- Dedicated MetalLB IP 10.0.20.202 with ETP: Local for CrowdSec real-IP detection
- Removed Cloudflare Email Routing (can't store-and-forward)
- Fixed dual SPF violation, hardened to -all
- Added MTA-STS, TLSRPT, imported Rspamd DKIM into Terraform
- Removed dead BIND zones from config.tfvars (199 lines)
Outbound:
- Migrated from Mailgun (100/day) to Brevo (300/day free)
- Added Brevo DKIM CNAMEs and verification TXT
Monitoring:
- Probe frequency: 30m → 20m, alert thresholds adjusted to 60m
- Enabled Dovecot exporter scraping (port 9166)
- Added external SMTP monitor on public IP
Documentation:
- New docs/architecture/mailserver.md with full architecture
- New docs/architecture/mailserver-visual.html visualization
- Updated monitoring.md, CLAUDE.md, historical plan docs
Document fixes from 2026-04-10 London network debugging session:
- pfSense WG now dual-stack (IPv4+IPv6 via HE tunnel gif0 pf rule)
- GL-iNet AllowedIPs must be single comma-separated UCI entry (parse bug)
- AdGuardHome/carrier-monitor must not use 1.1.1.1 (conntrack + rate limit)
- Expanded troubleshooting for site-to-site tunnel disconnects
- Technitium DNS now at dedicated MetalLB IP 10.0.20.201 (was shared 10.0.20.200)
- Document LAN DNS path: pfSense NAT redirect preserves client IPs for Technitium logging
- Document pfSense dnsmasq role (K8s VLAN + localhost only, not WAN)
- Document pfSense aliases (technitium_dns, k8s_shared_lb) for NAT rule maintainability
- Update MetalLB table with per-service IP assignments
- Add ClusterIP (10.96.0.53) for CoreDNS internal forwarding
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- vpn.md: Rewrite WireGuard section to match actual config (single tun_wg0
interface, 10.3.2.0/24 subnet, hub-and-spoke topology, correct device
names and subnets for London/Valchedrym)
- authentik-state.md: Document brute-force-protection policy unbinding fix
that was blocking all unauthenticated users from login flows
[ci skip]
Query logs stopped syncing on 2026-03-16 due to password mismatch after
MySQL cluster rebuild and Technitium app config reset.
- Add Vault static role mysql-technitium (7-day rotation)
- Add ExternalSecret for technitium-db-creds in technitium namespace
- Add password-sync CronJob (6h) to push rotated password to Technitium API
- Update Grafana datasource to use ESO-managed password
- Remove stale technitium_db_password variable (replaced by ESO)
- Update databases.md and restore-mysql.md runbook
- Healthcheck: add entity availability, integration health, automation
status, and system resources checks for Home Assistant Sofia
- Docs: add backup-dr architecture documentation
- Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm
- Update docs/architecture/storage.md: document Proxmox CSI as primary
block storage, mark democratic-csi iSCSI as deprecated
- Add full migration plan to docs/plans/
Expanded cloud sync excludes to reduce sync time and Synology disk usage.
All excluded data is either regenerable or low-value.
TrueNAS Task 1 and incremental script already updated live.