Commit graph

1911 commits

Author SHA1 Message Date
Viktor Barzin
2e8fbb51bf state(headscale): update encrypted state 2026-03-23 02:37:22 +02:00
Viktor Barzin
6bfb5e3285 state(headscale): update encrypted state 2026-03-23 02:29:03 +02:00
Viktor Barzin
a1b4a2106d add NFS/CSI cascade failure post-mortem [ci skip] 2026-03-23 02:25:11 +02:00
Viktor Barzin
e9311915cb add agent route to k8s-portal 2026-03-23 02:25:08 +02:00
Viktor Barzin
6bfade3013 update infra stack terraform lock file (helm/kubernetes/vault providers) 2026-03-23 02:24:47 +02:00
Viktor Barzin
2dcdc65db5 add weekly SQLite backup for plotting-book to NFS 2026-03-23 02:24:43 +02:00
Viktor Barzin
e4cf0dee83 add TrueNAS Cloud Sync monitor CronJob and bump Prometheus Helm timeout
- New cloudsync-monitor CronJob: queries TrueNAS API every 6h, pushes metrics to Pushgateway
- Increase Prometheus Helm timeout to 900s for slow iSCSI reattach
2026-03-23 02:24:39 +02:00
Viktor Barzin
e463281205 optimize backup schedules: compress dumps, stagger to weekly, extend retention
- dbaas: gzip MySQL/PostgreSQL dumps, stagger to 0:30, clean old uncompressed
- infra-maintenance: etcd backup daily→weekly Sunday 1am
- redis: backup hourly→weekly Sunday 3am, retention 7→28 days
- vault: raft backup daily→weekly Sunday 2am
2026-03-23 02:24:34 +02:00
Viktor Barzin
6e661fdfc5 add backup & DR strategy documentation with ASCII diagrams
Covers all 3 protection layers (ZFS snapshots, app-level backups,
offsite sync), the hybrid cloud sync architecture, iSCSI hardening,
monitoring alerts, and service protection matrix.
2026-03-23 02:24:02 +02:00
Viktor Barzin
644562454c add IPv6 connectivity via Hurricane Electric 6in4 tunnel
- Add public_ipv6 variable and AAAA records for all 34 non-proxied services
- Fix stale DNS records (85.130.108.6 → 176.12.22.76, old IPv6 → HE tunnel)
- Update SPF record with current IPv4/IPv6 addresses
- Add AAAA update support to Technitium DNS updater CLI
- Pin mailserver MetalLB IP to 10.0.20.201 for stable pfSense NAT
- pfSense: HE_IPv6 interface, strict firewall (80,443,25,465,587,993 + ICMPv6),
  socat IPv6→IPv4 proxy, removed dangerous "Allow all DEBUG" rules
2026-03-23 02:22:00 +02:00
Viktor Barzin
813f523170 docs: add private registry usage to infra CLAUDE.md [ci skip] 2026-03-23 01:08:57 +02:00
Viktor Barzin
1f4e8cb278 use registry.viktorbarzin.me hostname for private images + protect ingress
- Switch priority-pass images from 10.0.20.10:5050 to registry.viktorbarzin.me
- Add containerd hosts.toml for registry.viktorbarzin.me on all nodes + template
  (redirects to 10.0.20.10:5050 LAN direct, avoids Traefik round-trip)
- Enable Authentik protection on priority-pass ingress
2026-03-23 01:02:27 +02:00
Viktor Barzin
e9919d8fc9 fix priority-pass: bump backend memory to 512Mi (OOM with OpenCV) 2026-03-23 00:58:39 +02:00
Viktor Barzin
0674d6e538 deploy priority-pass app to cluster via private registry
- SvelteKit frontend + FastAPI backend in single pod with sidecar pattern
- Images pushed to 10.0.20.10:5050 private registry (v4/v1)
- SvelteKit server route proxies /api/transform to backend on 127.0.0.1:8000
- Exposed at priority-pass.viktorbarzin.me (Cloudflare-proxied, no auth)
- Uses imagePullSecrets for authenticated registry pulls
2026-03-23 00:55:41 +02:00
Viktor Barzin
d78be951b3 state(vaultwarden): update encrypted state 2026-03-23 00:51:56 +02:00
Viktor Barzin
311ff5dd9e add hourly SQLite integrity check for vaultwarden with Prometheus alerting
- New CronJob runs PRAGMA integrity_check every hour
- Pushes vaultwarden_sqlite_integrity_ok metric to Prometheus pushgateway
- VaultwardenSQLiteCorrupt alert fires immediately on corruption (critical)
- VaultwardenIntegrityCheckStale alert if check hasn't run in 2h (warning)
- Prevents running for days on a corrupted DB unnoticed
2026-03-23 00:50:15 +02:00
Viktor Barzin
3b89a7d7e4 add VaultwardenDown alert and tighten backup staleness threshold
- Add dedicated VaultwardenDown Prometheus alert (critical, 5m)
- Reduce backup staleness threshold from 8d to 24h to match 6h schedule
- Fixes monitoring gap where VW downtime went undetected
2026-03-23 00:47:00 +02:00
Viktor Barzin
a44f35bcf8 harden vaultwarden iSCSI storage and increase backup frequency
- Increase backup from daily to every 6 hours (0 */6 * * *)
- Add pre/post-flight SQLite integrity checks to backup job
- Harden iSCSI on all nodes: increase recovery timeout (300s),
  enable CRC32C data/header digests for bit-flip detection
- Fix restore runbook PVC name (vaultwarden-data-iscsi)

Motivated by SQLite corruption from iSCSI I/O errors.
2026-03-23 00:36:11 +02:00
Viktor Barzin
469fcb12b5 remove duplicate deploy-app skill, now global agent [ci skip] 2026-03-23 00:17:53 +02:00
Viktor Barzin
ab7e18c07c fix registry auth: add Kyverno RBAC for Secrets + containerd TLS skip-verify
- Grant kyverno-admission-controller and kyverno-background-controller
  permissions to manage Secrets (required for generate clone rules)
- Add containerd hosts.toml for 10.0.20.10:5050 with skip_verify=true
  (wildcard cert doesn't cover IP SANs) — applied to all nodes + template
2026-03-22 23:47:29 +02:00
Viktor Barzin
c111799831 remove duplicated agents, update CLAUDE.md references [ci skip]
All agents now live globally in ~/.claude/agents/ (shared via dotfiles).
Deleted 11 duplicates, moved sev-*/deploy-app to global scope.
2026-03-22 23:44:27 +02:00
Viktor Barzin
36171bcda4 add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me
- Add auth.htpasswd section to config-private.yml
- Mount htpasswd file in registry-private container, fix healthcheck for 401
- Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me
- Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body)
- Add docker to cloudflare_proxied_names (registry stays non-proxied)
- Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces
- Update infra provisioning to install apache2-utils and generate htpasswd from Vault
2026-03-22 22:10:10 +02:00
Viktor Barzin
e4f478b490 switch claude-memory server to multi-user API_KEYS auth
Enables isolated memory namespaces per user (wizard/emo) by switching
from single API_KEY to API_KEYS JSON map env var.
2026-03-22 20:08:07 +02:00
Viktor Barzin
a53b9438eb state(claude-memory): update encrypted state 2026-03-22 20:01:00 +02:00
Viktor Barzin
29aa6c95f0 state(redis): update encrypted state 2026-03-22 15:23:58 +02:00
Viktor Barzin
c103a1ee05 fix OOMKilled containers: bump immich/actualbudget memory, disable changedetection, cap clickhouse
- immich-server: 512Mi/1Gi → 1700Mi/1700Mi (VPA upperBound 1.39Gi, 34 OOM restarts)
- actualbudget http-api: 384Mi → 768Mi (VPA upperBound 615Mi, 3 OOM restarts)
- changedetection: replicas 1 → 0 (chronic OOM at 64Mi, not worth memory cost)
- rybbit clickhouse: add ConfigMap capping max_server_memory_usage to 800Mi (within 1Gi limit)
2026-03-22 15:22:29 +02:00
Viktor Barzin
3130a5f9e0 state(immich): update encrypted state 2026-03-22 15:18:25 +02:00
Viktor Barzin
894b8a2be8 state(vaultwarden): update encrypted state 2026-03-22 15:18:14 +02:00
Viktor Barzin
5e6c2de849 state(vaultwarden): update encrypted state 2026-03-22 15:18:10 +02:00
Viktor Barzin
7434b58ba6 state(infra-maintenance): update encrypted state 2026-03-22 15:18:05 +02:00
Viktor Barzin
ab95e0ab2f state(vault): update encrypted state 2026-03-22 15:18:03 +02:00
Viktor Barzin
432b3d0a60 state(rybbit): update encrypted state 2026-03-22 15:08:44 +02:00
Viktor Barzin
332fe30a19 state(actualbudget): update encrypted state 2026-03-22 15:08:08 +02:00
Viktor Barzin
250f8c4469 state(changedetection): update encrypted state 2026-03-22 15:07:58 +02:00
Viktor Barzin
ad689076d8 scale down non-critical services to free cluster memory
- authentik server: 3→2, worker: 3→2, PDB minAvailable: 2→1
- tuya-bridge: 3→1
- realestate-crawler-api: 2→1
- claude-memory: 2→1
- grafana: 2→1 (config only, apply pending)
- alertmanager: 2→1 (config only, apply pending)

Estimated savings: ~1.2 Gi total
2026-03-22 03:10:12 +02:00
Viktor Barzin
4cf147974e state(nextcloud): update encrypted state 2026-03-22 03:08:00 +02:00
Viktor Barzin
53857a9a87 state(authentik): update encrypted state 2026-03-22 03:07:51 +02:00
Viktor Barzin
6f37fb45bd state(real-estate-crawler): update encrypted state 2026-03-22 03:07:07 +02:00
Viktor Barzin
e2d9b97b00 state(claude-memory): update encrypted state 2026-03-22 03:06:50 +02:00
Viktor Barzin
2ac14bbf87 state(tuya-bridge): update encrypted state 2026-03-22 03:05:46 +02:00
Viktor Barzin
bd98b84ded scale grafana and alertmanager to 1 replica to free cluster memory
Grafana: 2 → 1 (saves ~312 Mi)
Alertmanager: 2 → 1 (saves ~150 Mi)
Matrix already scaled to 0 (saves ~212 Mi)
2026-03-22 03:02:17 +02:00
Viktor Barzin
1c13af142d sync regenerated providers.tf + upstream changes
- Terragrunt-regenerated providers.tf across stacks (vault_root_token
  variable removed from root generate block)
- Upstream monitoring/openclaw/CLAUDE.md changes from rebase
2026-03-22 02:56:04 +02:00
Viktor Barzin
1bf8676a6d state(platform): update encrypted state 2026-03-22 02:52:48 +02:00
Viktor Barzin
2e016d7df2 fix nextcloud db-username + k8s-dashboard chart repo
- nextcloud: add db-username to ESO secret template and usernameKey
  to chart values (required by newer chart version)
- k8s-dashboard: update chart repo URL to kubernetes-retired.github.io
  (old kubernetes.github.io/dashboard returns 404)
2026-03-22 02:50:48 +02:00
Viktor Barzin
e7433c17fb state(trading-bot): update encrypted state 2026-03-22 02:50:47 +02:00
Viktor Barzin
d215446455 state(k8s-dashboard): update encrypted state 2026-03-22 02:50:47 +02:00
root
e30a819592 Woodpecker CI Update TLS Certificates Commit 2026-03-22 00:33:29 +00:00
Viktor Barzin
3d22599f7f fix infra stack: use overwrite strategy for provider generation
The child generate block needs if_exists="overwrite" to properly
override the root terragrunt's k8s_providers block.
2026-03-22 01:28:25 +02:00
Viktor Barzin
2cddcafbe0 state(infra): update encrypted state 2026-03-22 01:28:07 +02:00
Viktor Barzin
728fbcd3bd fix infra stack: add vault provider to terragrunt generate block
The infra stack's provider override only included proxmox but main.tf
uses data "vault_kv_secret_v2" which requires the vault provider.
2026-03-22 01:17:00 +02:00