Commit graph

6 commits

Author SHA1 Message Date
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
bcad200a23 chore: add untracked stacks, scripts, and agent configs
- New stacks: beads-server, hermes-agent
- Terragrunt tiers.tf for infra, phpipam, status-page
- Secrets symlinks for vault, phpipam, hermes-agent
- Scripts: cluster_manager, image_pull, containerd pullthrough setup
- Frigate config, audiblez-web app source, n8n workflows dir
- Claude agent: service-upgrade, reference: upgrade-config.json
- Removed: claudeception skill, excalidraw empty submodule, temp listings

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 09:33:06 +00:00
Viktor Barzin
3e9231ae0d feat: augment outage report template with debugging context
- Expand service list: add Home Assistant, Actual Budget, Audiobookshelf,
  Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD,
  Excalidraw, Wealthfolio, Send, Stirling PDF
- Add structured debugging fields: error type, scope (just me vs others),
  when it started, URL accessed
- Fix user report parser to extract all form fields into status.json
- Show error type, scope, and start time in status page report cards

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:03:44 +00:00
Viktor Barzin
460c68e015 feat: add incident management system with user reporting
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
  expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
  → historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
  and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
  postmortem-done on infra repo

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:00:31 +00:00
Viktor Barzin
ff360a8807 feat: add external monitoring for all Cloudflare-proxied services
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:04:45 +00:00
Viktor Barzin
82b0f6c4cb truenas deprecation: migrate all non-immich storage to proxmox NFS
- Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127)
  (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book)
- Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS
- Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox
- Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks
- Delete stacks/platform/modules/ (27 dead module copies, 65MB)
- Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127)
- Remove iscsi DNS record from config.tfvars
- Fix woodpecker persistence config and alertmanager PV

Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.
2026-04-12 14:35:39 +01:00