infra

Author	SHA1	Message	Date
Viktor Barzin	f07f05f9bb	migrate Nextcloud data volume from NFS to iSCSI for fsync support SQLite on NFS caused persistent 500 errors on WebDAV PROPFIND due to missing fsync guarantees and database locking under concurrent access. iSCSI (ext4) provides proper fsync and block-level I/O. - Replace nfs_volume module with iscsi-truenas PVC (20Gi) - Update Helm chart to use nextcloud-data-iscsi claim - Excluded 12.5GB nextcloud.log and corrupted DB from migration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-11 23:24:03 +00:00
Viktor Barzin	4427530e65	Archive terraform.tfvars — secrets now in SOPS Removed from git tracking and added to .gitignore. File stays on disk locally for reference. config.tfvars + secrets.auto.tfvars.json are the active var sources. [ci skip]	2026-03-11 21:16:11 +00:00
Viktor Barzin	d7953322dd	fix cluster health: pin actualbudget, spread MySQL, scale grampsweb, fix GPU toleration - Pin actualbudget/actual-server from edge to 26.3.0 (all 3 instances) to prevent recurring migration breakage from rolling nightly builds - Add podAntiAffinity to MySQL InnoDB Cluster to spread replicas across nodes, relieving memory pressure on k8s-node4 - Scale grampsweb to 0 replicas (unused, consuming 1.7Gi memory) - Add GPU toleration Kyverno policy to Terraform using patchesJson6902 instead of patchStrategicMerge to fix toleration array being overwritten (caused caretta DaemonSet pod to be unable to schedule on k8s-master) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-11 11:43:34 +00:00
Viktor Barzin	6bdcd88d25	set Recreate strategy for plotting-book deployment iSCSI volumes are ReadWriteOnce and cannot multi-attach, so the old pod must terminate before the new one starts.	2026-03-10 23:47:30 +00:00
Viktor Barzin	5a9881337d	Add terminal stack - reverse proxy to ttyd behind authentik Exposes ttyd at 10.0.10.10:7681 via terminal.viktorbarzin.me with Cloudflare DNS and Authentik forward-auth protection.	2026-03-10 23:46:01 +00:00
Viktor Barzin	d8bcdfef2e	revert MaxRequestWorkers to 50, exclude nextcloud from 5xx alert - MaxRequestWorkers 25→50: too few workers caused ALL workers to block on SQLite locks, making liveness probes fail even faster (131 restarts vs 50 before). 50 is a compromise — enough workers for probes. - Excluded nextcloud from HighServiceErrorRate alert (chronic SQLite issue) - MySQL migration attempted but hit: GR error 3100 (fixed with GIPK), emoji in calendar/filecache (stripped), SQLite corruption (pre-existing from crash-looping). Migration rolled back, Nextcloud restored to SQLite.	2026-03-09 22:05:20 +00:00
Viktor Barzin	eed991a27b	exclude nextcloud from HighServiceErrorRate alert Nextcloud has chronic 5xx errors due to SQLite lock contention causing Apache worker exhaustion. Excluding from alert until MySQL migration.	2026-03-09 20:26:30 +00:00
Viktor Barzin	0ca81a6112	fix: mount Apache MPM config under nextcloud.extraVolumes (not top-level) The Nextcloud Helm chart expects extraVolumes/extraVolumeMounts nested under the nextcloud: key. Also mount to mods-available/ (the actual file) not mods-enabled/ (which is a symlink). Verified: MaxRequestWorkers 150→25, workers dropped from 49 to 6.	2026-03-08 21:37:39 +00:00
Viktor Barzin	ff03f2b99f	tune Nextcloud Apache/PHP to fix constant crash-looping (50 restarts/6d) Root cause: Apache prefork with 150 MaxRequestWorkers (each ~220MB RSS) on SQLite DB causes worker exhaustion + lock contention → Apache hangs → aggressive liveness probe (3 failures × 10s) kills container. Fixes: - Apache: MaxRequestWorkers 150→25, MaxConnectionsPerChild 0→200, StartServers 5→3 (via ConfigMap mount over mpm_prefork.conf) - PHP: max_execution_time 0→300s, max_input_time 300s (prevent zombie workers) - Liveness probe: period 10s→30s, failureThreshold 3→6, timeout 5s→10s (180s tolerance vs 30s before) - Readiness probe: period 10s→30s, timeout 5s→10s	2026-03-08 21:33:27 +00:00
Viktor Barzin	ad8b90575e	fix noisy JobFailed and duplicate mail server alerts - JobFailed: only alert on jobs started within the last hour, so stale failed CronJob runs don't keep firing after subsequent runs succeed - Mail server alert: renamed to MailServerDown, now targets the specific mailserver deployment instead of all deployments in the namespace (was falsely triggering on roundcubemail going down) - Updated inhibition rule to use new MailServerDown alert name	2026-03-08 21:22:43 +00:00
Viktor Barzin	33c7976630	reduce alert noise: add cascade inhibitions, increase for durations, drop Loki alerts - NodeDown now suppresses workload and service alerts (PodCrashLooping, DeploymentReplicasMismatch, StatefulSetReplicasMismatch, etc.) - NFSServerUnresponsive suppresses pod-level alerts - Increased for durations on transient alerts (e.g. 15m→30m for replica mismatches) - NodeDown for: 1m→3m to avoid flapping - Removed all 3 Loki log-based alerts (duplicated Prometheus alerts) - Downgraded HeadscaleDown critical→warning, mail server page→warning	2026-03-08 21:13:16 +00:00
Viktor Barzin	2fa8ba2038	[ci skip] add sealed secrets convention: fileset + kubernetes_manifest pattern - Document sealed secrets workflow in AGENTS.md and CLAUDE.md - Add kubernetes_manifest + fileset(sealed-.yaml) block to plotting-book as reference - Users: kubeseal encrypt → commit sealed-.yaml → CI applies via Terraform - E2E tested: seal/commit/plan/apply/decrypt cycle verified	2026-03-08 20:03:50 +00:00
Viktor Barzin	6b3e84f465	deploy Sealed Secrets controller for encrypted secret management Adds Sealed Secrets (Bitnami) to the platform stack so cluster users can encrypt secrets with a public key and commit SealedSecret YAMLs to git. The in-cluster controller decrypts them into regular K8s Secrets. - New module: sealed-secrets (namespace + Helm chart v2.18.3, cluster tier) - k8s-portal setup script: adds kubeseal CLI install for Linux and Mac	2026-03-08 19:49:48 +00:00
Viktor Barzin	d352d6e7f8	resource quota review: fix OOM risks, close quota gaps, add HA protections Phase 1 - OOM fixes: - dashy: increase memory limit 512Mi→1Gi (was at 99% utilization) - caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%) - mysql-operator: add Helm resource values 256Mi/512Mi, create namespace with tier label (was at 92% of LimitRange default) - prowlarr, flaresolverr, annas-archive-stacks: add explicit resources (outgrowing 256Mi LimitRange defaults) - real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no explicit resources) Phase 2 - Close quota gaps: - nvidia, real-estate-crawler, trading-bot: remove custom-quota=true labels so Kyverno generates tier-appropriate quotas - descheduler: add tier=1-cluster label for proper classification Phase 3 - Reduce excessive quotas: - monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64 - woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16 - GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16 Phase 4 - Kubelet protection: - Add cpu: 200m to systemReserved and kubeReserved in kubelet template Phase 5 - HA improvements: - cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1) - grafana: add topology spread + PDB via Helm values - crowdsec LAPI: add topology spread + PDB via Helm values - authentik server: add topology spread via Helm values - authentik worker: add topology spread + PDB via Helm values	2026-03-08 18:17:46 +00:00
Viktor Barzin	ead33b23dd	enable MySQL InnoDB Cluster auto-recovery after crashes Previously manualStartOnBoot=true and exitStateAction=ABORT_SERVER meant any ungraceful shutdown required manual rebootClusterFromCompleteOutage(). New settings: - group_replication_start_on_boot=ON: auto-start GR after crash - autorejoin_tries=2016: retry rejoining for ~28 minutes - exit_state_action=OFFLINE_MODE: stay alive on expulsion (don't abort) - member_expel_timeout=30s: tolerate brief unresponsiveness - unreachable_majority_timeout=60s: leave group cleanly if majority lost	2026-03-08 17:13:03 +00:00
Viktor Barzin	98f4920af1	[ci skip] remember: update kubelet thresholds when changing node memory	2026-03-08 10:34:17 +00:00
Viktor Barzin	fffc2ed0ab	fix node OOM: reduce memory overcommit ratio and add kubelet eviction thresholds LimitRange defaults had a 4-8x limit/request ratio causing the scheduler to overpack nodes. When pods burst, nodes OOM-thrashed and became unresponsive (k8s-node3 and k8s-node4 both went down today). Changes: - Increase default memory requests across all tiers (ratio now 2x): - core/cluster: 64Mi → 256Mi request (512Mi limit) - gpu: 256Mi → 1Gi request (2Gi limit) - edge/aux/fallback: 64Mi → 128Mi request (256Mi limit) - Add kubelet memory reservation and eviction thresholds: - systemReserved: 512Mi, kubeReserved: 512Mi - evictionHard: 500Mi (was 100Mi), evictionSoft: 1Gi (was unset) - Applied to all nodes and future node template	2026-03-08 10:33:38 +00:00
Viktor Barzin	193f2e2dc5	add iSCSI persistent volume for plotting-book SQLite database Create a 1Gi PVC using iscsi-truenas StorageClass, mount at /data, and set DB_PATH=/data/database.sqlite for persistent storage.	2026-03-07 21:57:22 +00:00
Viktor Barzin	4374e78869	[ci skip] fix Wealthfolio Homepage icon: wealthfolio.png → mdi-finance	2026-03-07 21:32:58 +00:00
Viktor Barzin	9d031290cc	[ci skip] fix Homepage icons for Tandoor, Listenarr, Networking Toolbox, Goldilocks tandoor.png → tandoor-recipes.png (dashboard-icons), podcast.png → mdi-podcast, networking.png → mdi-lan, goldilocks.png → mdi-scale-balance	2026-03-07 21:29:51 +00:00
Viktor Barzin	32bd30f56e	[ci skip] fix invalid Homepage dashboard icons for 9 services Use correct dashboard-icons names where available (changedetection, gramps-web), Material Design Icons for custom apps (city-guesser, plotting-book, resume, tuya-bridge, trading-bot, poison-fountain), and Simple Icons for F1 Stream.	2026-03-07 21:14:17 +00:00
Viktor Barzin	76a4987eef	[ci skip] add Forgejo task pipeline for OpenClaw AI agent Forgejo issues as a task queue for OpenClaw: - Forgejo OAuth2 with Authentik SSO, self-registration disabled - Webhook-triggered task processing (instant) + CronJob backup (5min poll) - Tasks processed via Mistral Large 3 (NVIDIA NIM API) - Results posted as issue comments, auto-labeled and closed - Comment follow-ups and reopened issues supported - n8n RBAC for OpenClaw pod exec (future workflow integration)	2026-03-07 21:11:07 +00:00
Viktor Barzin	c2765e890b	add nginx caching proxy for Homepage widget API requests Stale-while-revalidate cache in front of Homepage reduces first-paint latency by serving cached /api/ responses instantly while refreshing upstream in background. Non-API paths pass through uncached.	2026-03-07 21:11:07 +00:00
root	b4f9777ecd	Woodpecker CI deploy commit [CI SKIP]	2026-03-07 20:47:22 +00:00
Viktor Barzin	33b20ce111	add Google OAuth env vars to plotting-book deployment Deploy GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, and GOOGLE_CALLBACK_URL to the plotting-book container. Update CSP to allow accounts.google.com for connect-src and form-action directives.	2026-03-07 20:41:08 +00:00
Viktor Barzin	650c2a6ed4	[ci skip] add liveness probe to Send deployment Prevents stale Redis connections from silently breaking file uploads. The old Node.js Redis client doesn't auto-reconnect after Redis restarts, causing all files to appear expired.	2026-03-07 20:39:57 +00:00
Viktor Barzin	144fd151cb	[ci skip] fix Navidrome credentials: admin user is wizard not admin	2026-03-07 20:39:56 +00:00
Viktor Barzin	7af0024473	[ci skip] fix pfSense widget: wan interface is vtnet0 not vmx0	2026-03-07 20:39:56 +00:00
Viktor Barzin	3ec643a897	[ci skip] fix pfSense widget: remove wanStatus (API v2 missing gateway endpoint) Replace wanStatus with temp field. Remove wan interface param since the pfSense REST API v2 package doesn't expose /status/gateway.	2026-03-07 20:39:56 +00:00
Viktor Barzin	f3042f318e	[ci skip] fix widget issues: ports, Immich v2 API, Nextcloud trusted domains - qBittorrent: use service port 80 (not container port 8080) - Immich: add version=2 for new API endpoints (/api/server/*) - Nextcloud: use external URL (internal rejects untrusted Host header) - HA London: remove widget (token expired, needs manual regeneration) - Headscale: remove widget (requires nodeId param, not overview)	2026-03-07 20:39:56 +00:00
Viktor Barzin	17256c8f76	[ci skip] fix widget URLs: use correct k8s service ports Services expose port 80 via ClusterIP but widgets were using container target ports (5000, 3001, 4533, 3000). Calibre was using external URL through Authentik. All now use correct internal service URLs.	2026-03-07 20:39:56 +00:00
Viktor Barzin	c9bb470259	[ci skip] upgrade Homepage from v1.8.0 to v1.10.1	2026-03-07 20:39:56 +00:00
Viktor Barzin	57eed07370	[ci skip] add widgets for qbittorrent, navidrome, nextcloud, freshrss, linkwarden, uptime-kuma Add API credentials to SOPS and wire homepage_credentials through stacks. Re-add Uptime Kuma widget with new "infra" status page slug.	2026-03-07 20:39:55 +00:00
Viktor Barzin	7027c49fef	[ci skip] update ha-sofia VM: VMID 103, disk 64G, SSH access info	2026-03-07 20:39:55 +00:00
Viktor Barzin	10acdcd5a2	[ci skip] add widgets for audiobookshelf, changedetection, prowlarr, headscale Wire homepage_credentials through servarr parent stack for prowlarr. Fix paperless-ngx widget to use internal service URL.	2026-03-07 20:39:55 +00:00
Viktor Barzin	1f1700c4ff	[ci skip] fix broken Homepage widgets + add service API tokens to SOPS - Grafana: fix service URL (grafana not monitoring-grafana) - Uptime Kuma: remove widget (no status page configured) - Speedtest/Frigate/Immich: use internal k8s service URLs (external goes through Authentik forward auth, blocking API calls) - pfSense: clean up annotations - SOPS: add headscale, prowlarr, changedetection, audiobookshelf tokens	2026-03-07 20:39:55 +00:00
Viktor Barzin	a9daf50142	[ci skip] add Homepage widget credentials for Authentik, Shlink, Home Assistant Wire homepage_credentials tokens through platform stack to enable live widgets for Authentik, Shlink (URL shortener), and Home Assistant London. Update SOPS with new credential entries.	2026-03-07 20:39:54 +00:00
Viktor Barzin	6bd3970579	[ci skip] add Homepage gethomepage.dev annotations to all services Add Kubernetes ingress annotations for Homepage auto-discovery across ~88 services organized into 11 groups. Enable serviceAccount for RBAC, configure group layouts, and add Grafana/Frigate/Speedtest widgets.	2026-03-07 20:39:54 +00:00
OpenClaw	cf386e06cd	Update MEMORY.md timestamp	2026-03-07 16:43:15 +00:00
Viktor Barzin	2dc5ab8995	[ci skip] fix false-positive sensitive=true on kube_config_path	2026-03-07 15:48:19 +00:00
Viktor Barzin	7cc7991ce6	[ci skip] claudeception: extract 2 skills from today's session 1. sops-age-secrets-migration: Complete guide for migrating from git-crypt to SOPS+age. Covers JSON format requirement, race condition avoidance, CI integration, complex types, and migration sequence. 2. iterative-plan-review-with-subagents: Design pattern for reviewing plans with parallel security + implementation subagents. 2-3 iterations to zero CRITICALs. Used successfully for the SOPS migration design.	2026-03-07 15:46:36 +00:00
Viktor Barzin	9f2ac0fd1a	[ci skip] update AGENTS.md + CLAUDE.md with SOPS workflow, add k8s-portal CI pipeline AGENTS.md: added SOPS secrets management section, scripts/tg usage, contributor onboarding steps, pull-through cache bypass notes. CLAUDE.md: added SOPS workflow note, linux/amd64 build reminder, versioned tag guidance for pull-through cache. CI: new .woodpecker/k8s-portal.yml pipeline — auto-builds and deploys the k8s portal when files under stacks/platform/modules/k8s-portal/files/ change on master push. Uses buildx for linux/amd64.	2026-03-07 15:37:19 +00:00
Viktor Barzin	b6aacf7b02	[ci skip] fix Svelte 5 table structure (thead/tbody required) + use versioned image tag Fixed architecture and services pages to wrap table rows in <thead>/<tbody> as required by Svelte 5's strict HTML validation. E2E test passed: clean Alpine container → setup script → kubectl installed → CA cert verified against API server → TLS SUCCESS	2026-03-07 15:34:32 +00:00
Viktor Barzin	6f8b48a73c	[ci skip] k8s portal: fix setup script + add onboarding hub (5 new pages) Bug fixes: - CA cert now populated in ConfigMap (was empty → TLS failures) - Remove useless heredoc quote escaping in setup script - Fix homepage: VPN callout, correct verification command (get namespaces) - Fix false-positive sensitive=true on ingress_path, tls_secret_name, truenas_host, ollama_host, client_certificate_secret_name New pages (direct Svelte, no mdsvex dependency): - /onboarding: step-by-step guide (VPN, kubectl, git, first PR) - /architecture: cluster topology, storage, networking, tiers - /services: catalog of 70+ services with URLs - /contributing: PR workflow, what you can/can't change, NEVER list - /troubleshooting: common issues and fixes Navigation bar added to layout. All pages use consistent docs styling. Requires Docker image rebuild: cd stacks/platform/modules/k8s-portal/files && docker build -t viktorbarzin/k8s-portal:latest . && docker push	2026-03-07 15:06:26 +00:00
Viktor Barzin	5907e50fda	[ci skip] update ha-london skill: SSH is hassio@192.168.8.103 (HA OS) Old Pi at 192.168.8.104 no longer runs HA. Updated SSH host, user, config path, and platform info to reflect HA OS on 192.168.8.103.	2026-03-07 14:34:44 +00:00
Viktor Barzin	1f2c1ca361	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars Phase 5 — CI pipelines: - default.yml: add SOPS decrypt in prepare step, change git add . to specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure - renew-tls.yml: change git add . to git add secrets/ state/ Phase 6 — sensitive=true: - Add sensitive = true to 256 variable declarations across 149 stack files - Prevents secret values from appearing in terraform plan output - Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid breaking module interface contracts Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret to be created before the pipeline will work with SOPS. Until then, the old terraform.tfvars path continues to function.	2026-03-07 14:30:36 +00:00
Viktor Barzin	fb1347a130	[ci skip] phase 3: switch terragrunt to load config.tfvars + SOPS secrets terragrunt.hcl now loads: - config.tfvars (required, plaintext) - terraform.tfvars (optional, git-crypt — backward compat) - secrets.auto.tfvars.json (optional, SOPS-decrypted) before_hook checks that at least one secrets source exists. Use `scripts/tg` wrapper for SOPS-based workflow. Old terraform.tfvars kept for reference and backward compatibility.	2026-03-07 14:16:28 +00:00
Viktor Barzin	0d8e3484be	[ci skip] phase 2: split terraform.tfvars into config.tfvars + secrets.sops.json config.tfvars (29 vars, plaintext): hostnames, IPs, DNS records, IDs secrets.sops.json (140 vars, SOPS-encrypted): passwords, tokens, keys, maps Both files coexist with terraform.tfvars — no functional change yet. Complex types preserved: maps (mailserver_accounts, k8s_users, homepage_credentials), lists (xray_reality_clients), heredocs as \n-escaped JSON strings (SSH keys, WireGuard conf, headscale config).	2026-03-07 14:04:40 +00:00
Viktor Barzin	39333033a6	[ci skip] phase 1: SOPS tooling setup (.sops.yaml, scripts/tg, .gitignore) Part of SOPS multi-user secrets migration. - .sops.yaml: defines age recipients (Viktor + CI) - scripts/tg: wrapper that decrypts secrets before running terragrunt - .gitignore: excludes decrypted secrets.auto.tfvars.json No functional change — terraform.tfvars still works as before.	2026-03-07 13:57:42 +00:00
Viktor Barzin	91d11e5cda	[ci skip] add SOPS multi-user secrets migration design (v3, reviewed 3x) Replaces git-crypt all-or-nothing encryption with SOPS per-value encryption. Operators push PRs → Viktor reviews → CI applies. No encryption keys needed for operators. 7-phase migration plan, reviewed by 2 agents across 3 iterations (0 remaining CRITICALs).	2026-03-07 13:55:05 +00:00

1 2 3 4 5 ...

1585 commits