infra

Author	SHA1	Message	Date
Viktor Barzin	3027ab85a8	recruiter-responder: bump image_tag to 189ef901 OpenClaw can now answer 'what do we know about <company>?' from cache via the new recruiter_company_research tool, and recruiter_get embeds the cached research payload inline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:49 +00:00
Viktor Barzin	be3b94da85	keel: pin chart 1.0.6 → 1.2.0 (1.0.6 doesn't exist) The Helm repo at https://charts.keel.sh has versions 1.0.0–1.0.5, 1.1.0, 1.2.0. 1.0.6 is not published, so the Phase 0 apply failed silently. Bump to 1.2.0 (app version 0.21.1, latest stable).	2026-05-22 14:16:48 +00:00
Viktor Barzin	411524a10d	kured: drop Mon-Fri restriction, reboot any day The weekday-only schedule was a 2026-03-16-incident-era guardrail when the rest of the safety net was thin. Today's gates — halt-on-alert, sentinel-gate Check 4 (24h soak via node Ready transitions), the K8sUpgradeStalled alert, drainTimeout=30m, concurrency=1, and the sentinel-path fix from earlier today — make weekend reboots safe and just clear the backlog faster. Effect: 5 pending node reboots clear in 5 calendar days instead of queueing up over weekends. The K8s version-upgrade detection at Sun 12:00 UTC self-defers if a Sunday-morning kured reboot fires (the RecentNodeReboot alert is in the Upgrade Gates ignore-less list for the version-upgrade preflight — same mechanism kured uses). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	2e52583abd	Phase 1a: enroll 4 self-hosted services in Keel auto-update Enrolls the cleanest Woodpecker-build-only self-hosted services into the inject-keel-annotations ClusterPolicy by labeling their namespaces keel.sh/enrolled=true. CI already pushes :latest (auto_tag: true) on each, so Keel will detect the current upstream digest and trigger a rolling restart when polling starts (1h cadence). Per-Deployment lifecycle extended with KYVERNO_LIFECYCLE_V2 to suppress the annotation drift Kyverno will inject (keel.sh/policy, /trigger, /pollSchedule). Services included: - fire-planner - job-hunter - payslip-ingest - recruiter-responder Skipped from Phase 1 for follow-up: - claude-agent-service (user has WIP on main.tf) - claude-memory (Postgres co-deployed; treat in Phase 9 with other DBs) - kms (two Deployments; needs per-resource review) - wealthfolio (sync sidecar pattern; needs review) - chrome-service (deliberate :v4 pin; needs keel.sh/policy: never label) - GHA-migrated repos (10) (need per-repo CI cleanup) - beadboard, freedify (no CI) See docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	5acfab5bb9	recruiter-responder: bump image_tag to f3cb91ff (180d research_cache TTL) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	e5a65c11a9	recruiter-triage v3: Perks & Office Life section + cache-first deep_research - claude-agent-service bumped to f764fef6 (agent system prompt adds the Perks block: food/health/pension/equity/PTO/parental/equipment/ learning/wellness/amenities/commuter). 1200-word cap. - recruiter-responder bumped to 38a2cdaa (cache-first deep_research: serves cached payload if fetched_at + ttl_seconds > now; cache writes upsert; new force flag bypasses). Verified end-to-end: deep_research on Datadog now returns full Perks section (~220s, $0.60, 23 turns). Earlier 500 fixed (was uq_research_company_tier dup-key on re-run). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	020f62555b	Phase 0: install Keel + Kyverno auto-update annotation injector Foundation for opt-out-pure auto-update model per docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md. - New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6). Polls registries hourly per design decision #8. Default schedule overridable per-workload via keel.sh/pollSchedule annotation. - New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments, StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true` with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h. - Phase 0 enrolls no namespaces. Phase 1 (next session) labels the self-hosted set. - Per-workload opt-out: label `keel.sh/policy: never` (used by rollback runbook and chrome-service-style deliberate pins). - Keel namespace excluded from the mutate — supervisor self-update has too-bad a failure mode (decision #11). - AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the ignore_changes block enrolled workloads need. - .claude/CLAUDE.md: docker-images rule flagged as transitional. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	9476649539	docs/pm: kured silently stalled 6 days + Anubis HA lift (2026-05-16) Captures the May 10–16 kured-vs-sentinel-gate hostPath mismatch (chart derived hostPath from configuration.rebootSentinel) and the companion work to harden the rolling-reboot pipeline against single-replica PDB deadlocks: Anubis 1→2 replicas with shared Valkey store, kured drainTimeout=30m, CNPG pg-cluster 2→3 instances. Includes the mysql-standalone-PDB orphan cleanup and the k8s-node1 containerd-source drift audit (benign). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	3ef860b2be	kured + cnpg: drain-safe defaults ahead of Monday reboot wave Three defensive moves to make the kured rolling-reboot cycle survive edge cases without operator intervention: kured (stacks/kured/main.tf): - Set `configuration.drainTimeout = "30m"`. Default is unlimited; if a future PDB or finalizer stalls drain, kured retries forever and the node stays cordoned silently. 30m caps the silent-failure window — after timeout kured logs the abort and waits for the next period; the node stays Schedulable so cluster capacity isn't lost. Lets us fail closed instead of fail-silent. CNPG pg-cluster (stacks/dbaas/modules/dbaas/main.tf): - Bump instances 2 → 3 (1 primary + 2 replicas). With 2 instances the failover during a primary-node drain depended on the lone replica being caught up; a WAL backlog would stall the drain until the replica was current. With 3 instances CNPG always has at least one fully-current replica to promote, and the PDB's `minAvailable=1` on the primary selector is satisfied throughout the switchover. Storage: +20Gi PVC on proxmox-lvm-encrypted (about 35Gi after autoresize). Memory: +3Gi pod limit. - Updated the `triggers.instances` so the null_resource's local-exec actually re-applies the YAML (kubectl apply with the new spec). The YAML is the source-of-truth but the trigger is what tells terraform to re-run the provisioner. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:48 +00:00
Viktor Barzin	4ff3638065	state(dbaas): update encrypted state	2026-05-22 14:16:48 +00:00
Viktor Barzin	08bf5e47b7	state(dbaas): update encrypted state	2026-05-22 14:16:48 +00:00
Viktor Barzin	5768216d0e	anubis: HA with shared valkey/redis store + replicas=2 Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge state lived in process memory — a challenge issued by pod A wouldn't be verifiable by pod B (HTTP 500 "store: key not found"). The PDB at `minAvailable=1` made this worse: with replicas=1 the eviction API can NEVER satisfy the constraint, so every drain on a node hosting an Anubis pod looped forever. This is what stalled the manual K8s upgrade on 2026-05-11 (had to delete pods directly to bypass eviction) and was about to block kured on Monday 2026-05-18 once the kured sentinel fix landed. Anubis upstream has first-class support for a Valkey/Redis-protocol shared store (documented as the "Kubernetes worker pool" pattern). Wire it up: - modules/kubernetes/anubis_instance: add `shared_store_url` variable. When set, appends a `store: { backend: valkey, parameters: { url } }` block to the rendered policy YAML and defaults replicas to 2 (capped at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so drains can take down one pod at a time. topologySpreadConstraint tightened to `DoNotSchedule` so the two replicas land on different nodes — a single node loss never takes a whole Anubis instance down. - All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog, travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster Redis already runs HA via Sentinel + haproxy, no new infra needed. Verified: every Anubis Deployment now 2/2 Ready with pods on different nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated by live traffic post-apply; Palo Alto Networks scanner hit blog right after apply and the challenge log shows the new state path. Drain on any worker now succeeds without a `predrain_unstick` workaround — eviction API is satisfied because at most one pod is unavailable at a time, and the other replica keeps serving. Monday's kured reboot wave should roll through cleanly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	3025879478	claude-agent-service: ship recruiter-triage agent + restore missing terragrunt.hcl - main.tf: bump image_tag to 1b3350c0 (carries the new agent), init container also copies recruiter-triage.md into /home/agent/.claude/agents/. - terragrunt.hcl: restored (file was missing — apply was blocked). Standard root include + platform/vault/external-secrets dependencies. Smoke-tested 2026-05-16: deep_research call on Datadog (thread 42) via recruiter-responder REST API → 102.5s, $0.43, structured markdown report with comp bands vs £600k floor, culture signals, remote policy, recent news, sources cited. End-to-end Tier-2 is live. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	ea2342b8e2	docs: add CONTEXT.md domain glossary [ci skip] Adds the per-repo domain glossary that engineering skills (diagnose, tdd, improve-codebase-architecture, grill-with-docs) read before working in this repo. Terms only — no implementation detail. Six clusters (code organization, cluster, networking, storage, secrets, CI/CD), 22 terms, plus relationships, an example dialogue, and five flagged ambiguities. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	ce5f3ec209	recruiter-responder: expose Gmail IMAP creds for backtest CLI Pulls vbarzin@gmail.com app password from secret/recruiter-responder (seeded from secret/wealthfolio.imap_password — same Gmail credential that wealthfolio uses for broker-statement ingestion). Env vars GMAIL_IMAP_USER + GMAIL_IMAP_PASS, consumed by 'backtest gmail'. Backtest verified 2026-05-16 against folder 'companies-I-dont-take-seriously': 20/20 recruiter, 100% company extraction (9 stated, 6 subject, 4 sender_domain, 1 body), 30% comp, avg 12s latency. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	065982d978	kured: fix sentinel path mismatch that stalled rolling reboots The kured Helm chart derives the sentinel hostPath from `dirname(configuration.rebootSentinel)`. Previously rebootSentinel=/sentinel/gated-reboot-required pointed hostPath at `/sentinel/` (an empty auto-created directory on every host) while the kured-sentinel-gate DaemonSet writes to /var/run/gated-reboot-required. Two different host directories → kured never saw the open gate, even though the gate's checks were all green every 5 min on every node. Result: unattended-upgrades has packages waiting on every node since 2026-05-10 (when uu was re-enabled) and kured's hourly log says "Reboot not required" for the entire period. Set rebootSentinel=/var/run/gated-reboot-required so the chart mounts hostPath /var/run — same directory the gate writes to. The in-pod mountPath (/sentinel) is hardcoded by the chart and doesn't matter, the symlink chain works out: /sentinel/<file> inside the pod resolves to /var/run/<file> on the host. Verified: kured pod can now list /sentinel/gated-reboot-required (0 B) AND /sentinel/reboot-required (32 B, set by uu on 2026-05-15). First gated reboot will land Mon 2026-05-18 02:00 London. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	80e6314bf0	recruiter-responder: bump image_tag to 559e5c57 PDF extraction, tech_stack list, aggressive company/comp inference, no-phone-call drafts, backtest CLI. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	8e11caff8d	recruiter-responder: bump image_tag to bbd178da (structured Telegram + comp floor) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	391c002f9a	service-catalog: add aiostreams entry Stremio stream aggregator now has its own row in the Active Use tier. Captures the auth model (own UUID+password, not Authentik), monitoring posture (canary probe + 3 alerts), and backup pipeline (weekly NFS dumps of both decrypted config and the Stremio account addon collection). Follow-up from the 2026-05-15/16 hardening session: 5 commits on servarr/aiostreams, none previously catalogued.	2026-05-22 14:16:47 +00:00
Viktor Barzin	24ce3e267d	aiostreams: weekly backup of Stremio account addon collection Adds stremio-account-backup CronJob (Sun 04:00 weekly, offset 1h from the AIOStreams config-backup at 03:00): - Logs into api.strem.io with credentials from Vault (secret/viktor.stremio_email + stremio_password, now also synced into the aiostreams-probe-secrets ExternalSecret) - Fetches the full addonCollection via addonCollectionGet - Writes timestamped JSON to the existing aiostreams-backup PVC (NFS /srv/nfs/aiostreams-backup/stremio-collection-*.json, mode 0600) - 90-day retention, logs out to invalidate the auth key - Pushgateway metrics: stremio_account_backup_{success,bytes, addon_count,duration_seconds,last_run_timestamp} Protects against: accidental "uninstall all" / API regression / wrong account login wiping the curated set of 22 addons (Cinemeta + 16 MDBList + AIOStreams + More Like This + Formulio + Zamunda + Local). Verified: manual run wrote 93480 bytes, 22 addons, file present on NFS.	2026-05-22 14:16:47 +00:00
Viktor Barzin	aa6e9b0242	recruiter-responder: public /cb ingress for Telegram URL-button callbacks - Add ingress_factory module (auth=none, HMAC + expiry are the gate); ingress_path=["/cb"] only — /api stays internal, /healthz cluster. dns_type=proxied. anti_ai_scraping=false. - Drop setup_tls_secret module — Kyverno ClusterPolicy `sync-tls-secret` auto-clones the wildcard cert into every namespace. - Bump image_tag to 7383b426 (callback endpoints + SMTP STARTTLS hostname relax). - Wire CALLBACK_BASE_URL=https://recruiter-responder.viktorbarzin.me. - Drop git-crypt-encrypted wildcard cert files into stacks/recruiter-responder/secrets/. Allowlist privkey.pem in a new .gitleaksignore — git-crypt encrypts at rest but the working-tree copy is plaintext, so gitleaks can't tell. Smoke-tested end-to-end 2026-05-15 23:45: synthetic email -> Telegram with ✅/❌ buttons -> ✅ tapped via curl -> 'Sent' HTML page -> thread.status=sent, decision row recorded with decided_via=telegram_button, outbound message threaded correctly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	77010b769a	aiostreams: whitelist Vidhin + Tamtaro sync URLs Adds two env vars on the AIOStreams deployment: - WHITELISTED_REGEX_PATTERNS_URLS: Vidhin's release-group regex (TRaSH-aligned) so syncedRankedRegexUrls works for the user - WHITELISTED_SEL_URLS: Vidhin's ranked stream expressions + Tamtaro's ISE/PSE/ESE-standard Gotcha: AIOStreams validates each synced* field against the matching whitelist — stream-expression files (incl. Vidhin's expressions.json) go in WHITELISTED_SEL_URLS, not the regex one, even though they live in Vidhin's regex repo. Mixing them up returns USER_INVALID_CONFIG. User config: enabled Vidhin's regex + ranked expressions + Tamtaro's ISEs. Skipped Tamtaro PSE/ESE for now to avoid surprise over-filtering; can be added later from the same whitelist.	2026-05-22 14:16:47 +00:00
Viktor Barzin	c396092c86	aiostreams: weekly NFS backup of decrypted user config Adds aiostreams-config-backup CronJob (Sun 03:00 weekly): - Pulls /api/v1/user via internal ClusterIP with UUID + password from the existing aiostreams-probe-secrets ExternalSecret - Writes timestamped JSON to nfs-backup PVC mounted at /backup - 90-day retention, prunes older files - Pushgateway metrics: aiostreams_config_backup_{success,bytes,duration,last_run_timestamp} NFS path: 192.168.1.127:/srv/nfs/aiostreams-backup (auto-synced offsite to Synology via the existing offsite-sync-backup CronJob). Complements the daily postgresql-backup-per-db pipeline (which dumps the encrypted blob) by storing the decrypted JSON — usable for human inspection / disaster recovery even without the AIOStreams password. Verified: manual job wrote 12931 bytes, file present on NFS.	2026-05-22 14:16:47 +00:00
root	1177a82452	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:47 +00:00
Viktor Barzin	a98b00324d	recruiter-responder: pin image tag + run plugin installer init as root - stacks/recruiter-responder/terragrunt.hcl: bump image_tag to 0500c3d3 (300s LLM timeouts + IMAP BODY.PEEK[] fix). - stacks/openclaw/main.tf: install-recruiter-plugin init container now runs as uid 0 — the openclaw NFS volume is owned by uid 1000 and the recruiter-responder image otherwise drops to uid 10001 which can't write or chown. Smoke-tested end-to-end 2026-05-15 ~23:15: Synthetic recruiter email -> IMAP IDLE EXISTS push -> qwen3-8b triage (12.1s, JSON output complete with company/role/salary/location/tech) -> 2 drafts persisted in Postgres -> Telegram sendMessage 200 OK. Then deleted 3 stale n8n workflows W992Nr7..., 1AU4k7..., IisDNx... from the n8n Postgres workflow_entity table. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:47 +00:00
Viktor Barzin	a72590db7d	recruiter-responder: vault DB role + switch proactive push to Telegram - stacks/vault/main.tf: register pg-recruiter-responder static role on the postgresql connection (7d password rotation). Adds the role to allowed_roles and creates vault_database_secret_backend_static_role for `recruiter_responder` user. - stacks/recruiter-responder/main.tf: drop TASK_WEBHOOK_URL env, swap TASK_WEBHOOK_TOKEN secret for TELEGRAM_BOT_TOKEN + TELEGRAM_CHAT_ID. Updated header doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:46 +00:00
Viktor Barzin	89e9471e87	state(vault): update encrypted state	2026-05-22 14:16:46 +00:00
Viktor Barzin	7e1580ba8c	recruiter-responder: deploy stack + llama-cpp qwen3-8b + openclaw plugin mount Three coupled changes for the new recruiter-responder pipeline: 1. stacks/llama-cpp/: add qwen3-8b text-only model to llama-swap. Uses unsloth/Qwen3-8B-GGUF Q4_K_M, 16k context, no mmproj. Refactored the download Job script + cmd renderer to handle text_only=true (skip mmproj download + --mmproj flag). The 3 existing vision models stay on text_only=false; no behaviour change for them. 2. stacks/recruiter-responder/: new stack. Namespace, 2 ExternalSecrets (app secrets from secret/recruiter-responder, DB creds from Vault DB engine static-creds/pg-recruiter-responder), Deployment (replicas=1, Recreate -- IMAP IDLE + APScheduler want single leader), Service ClusterIP. Image: forgejo.viktorbarzin.me/viktor/recruiter-responder. 3. stacks/openclaw/: add init container `install-recruiter-plugin` that uses the recruiter-responder image to copy the .mjs plugin into /home/node/.openclaw/extensions/recruiter-api/ on NFS. Couples plugin version to the recruiter-responder image tag. Also injects RECRUITER_RESPONDER_URL + RECRUITER_RESPONDER_TOKEN env vars (token from openclaw-secrets.recruiter_responder_bearer_token, optional). Pre-apply checklist for recruiter-responder stack: - Vault: seed secret/recruiter-responder with webhook_bearer_token, imap_{me,spam}_{user,pass}, smtp_password, claude_agent_token, task_webhook_token. - Vault: add secret/openclaw.recruiter_responder_bearer_token (same as above webhook_bearer_token). - dbaas: create DB recruiter_responder + role recruiter_responder, and Vault DB-engine role static-creds/pg-recruiter-responder. - Build + push image via Woodpecker (recruiter-responder repo CI). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:46 +00:00
Viktor Barzin	95b9f7bc89	aiostreams: 1h stream cache + canary stream-count probe + 3 alerts Hardening pass following the empty-stream-list incident: 1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 / disabled). Default behaviour hit all 5 upstream addons on every Stremio request; with a 1h TTL repeat requests for the same title are instant, while RD cache invalidations still propagate quickly. 2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's encryptedPassword via the internal ClusterIP, runs a canary stream search for Breaking Bad S01E01, pushes streams_count + probe_success to Pushgateway. Uses an ExternalSecret pulling UUID + password from Vault secret/viktor. Same pattern as email-roundtrip-monitor. 3. Three alerts in monitoring's prometheus_chart_values.tpl: - AIOStreamsStreamCountLow (< 50 streams for 30m) - AIOStreamsProbeFailing (probe_success == 0 for 30m) - AIOStreamsProbeStale (last_run_timestamp > 30min for 10m) Verified: probe returned streams=411 success=1 on first run; all 3 alerts loaded into Prometheus with state=inactive health=ok.	2026-05-22 14:16:46 +00:00
root	fba5ee2df4	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:46 +00:00
Viktor Barzin	c73234982f	aiostreams: pin nightly + switch to auth=app - Pin viren070/aiostreams:nightly → :2026.05.14.1326-nightly (avoid stale-pull cache, matches 8-char SHA convention for rolling tags) - Switch ingress auth tier required → app: Authentik forward-auth blocks Stremio clients (cannot follow OAuth 302), and AIOStreams already enforces UUID + password on /configure and /api/*, with Stremio addon URLs using encryptedPassword as a bearer token. Result: empty-stream-list issue fixed for public Stremio clients. Verified: 410 streams returned via public URL for Breaking Bad S01E01 with no cookies, vs 0 before (502→Authentik OIDC redirect).	2026-05-22 14:16:46 +00:00
Viktor Barzin	2903ab9778	monitoring(wealth): move Positions table under contrib/growth row Positions panel now sits at y=32 (immediately below the contrib-vs-market + growth row at y=22..32), and everything from the per-account stack down shifts 8 rows lower.	2026-05-22 14:16:46 +00:00
Viktor Barzin	8461275308	wealth: positions table panel (shares + cost basis + unrealised return) pg-sync sidecar now mirrors three extra views from the wealthfolio SQLite: assets (id/symbol/name/currency), quote_latest (one row per asset, preferring YAHOO over MANUAL on same-day collisions), and positions_latest (currently-held positions extracted from the TOTAL aggregate row of holdings_snapshots — quantity, average cost, total cost basis). Wealth dashboard gets a new bottom Positions table joining the three: symbol, name, shares, avg cost, last price, market value, cost, gain, return %. Gain and return % are color-text with red<0, green>=0 thresholds.	2026-05-22 14:16:46 +00:00
Viktor Barzin	d6049ff7a0	terminal: extract app code to viktor/terminal-lobby on Forgejo The lobby has grown enough (frontend, two Go services, devvm units + scripts + config) that it earns its own repo. Code now lives at https://forgejo.viktorbarzin.me/viktor/terminal-lobby with scripts/deploy.sh covering the manual deploy until CI activation lands (Woodpecker forge_id=2 activation still 500s; Forgejo Actions not yet enabled). This stack now owns only the K8s side — Services, Endpoints, IngressRoutes, middlewares. main.tf comment block updated to point at the new repo and the full DevVM port map. Removed: - stacks/terminal/files/ (index.html + DevVM artefacts) - stacks/terminal/tmux-api/ (Go service) - stacks/terminal/clipboard-upload/ (Go service)	2026-05-22 14:16:46 +00:00
Viktor Barzin	c135c04c79	terminal: make slate the default theme	2026-05-22 14:16:46 +00:00
Viktor Barzin	a44aa52e1a	terminal: theme picker (carbon/slate/mono/ink) replacing violet Drops the hardcoded violet/indigo palette. Four themes are defined as CSS variables on body.theme-{carbon,slate,mono,ink}: - Carbon (default): warm dark, ivory text, restrained amber accent. - Slate: cool dark, GitHub/Linear-ish charcoal with electric blue. - Mono: strict greyscale, off-white accent. - Ink: warm paper light, deep ink, terracotta accent. The lobby reads the choice from localStorage and applies the class before render. The picker lives at the bottom of the sidebar (margin-top: auto pins it). On change, the iframe is bounced through about:blank so the inner xterm picks up the new computed CSS vars (--terminal-bg/fg/cursor/selection) on the next mount. Picker UI uses native buttons, current theme highlighted with the accent border + color. No gradients, hairline borders only.	2026-05-22 14:16:45 +00:00
Viktor Barzin	cbe83597c0	terminal: rename sessions + drag-and-drop reorder Backend: POST /sessions/<name>/rename in tmux-api runs tmux rename-session as the mapped OS user. 400 on bad name, 404 on missing source, 409 on duplicate target, 401 on missing auth header. Frontend: - Rename button per card → prompt() dialog, validates against the shared regex. Updates currentActive + hash + iframe.src if the renamed session was active. - Session order is now user-driven, persisted in localStorage keyed per osUser. New sessions append at the bottom. The previous sort-by-lastActivity is gone. - HTML5 drag-and-drop reorders cards live during dragover; dragend captures the DOM order into localStorage. - Polling renderLobby is suppressed while a drag is in flight so the 5s tick doesn't yank the list out from under the user.	2026-05-22 14:16:45 +00:00
Viktor Barzin	04fd241679	terminal: inline session switching via sidebar + iframe Replace full-page navigation with a two-pane lobby. Sidebar holds the session list as clickable cards; an iframe in the content pane swaps its src on click so switching sessions takes one click instead of two navigations. - #lobby-shell grid (260px sidebar + iframe pane) - Cards become role=button, kill button stops propagation - activateSession/deactivateSession with hash routing (location.hash <-> active session, replaceState so back stack stays clean) - Killed active session deactivates the iframe before re-render - 5s session poll preserves currentActive; deactivates if gone - Mobile media query collapses to one column CSP frame-ancestors already permits same-origin embedding (*.viktorbarzin.me), no infra changes needed. Direct-link ?arg=<name> path is unchanged.	2026-05-22 14:16:45 +00:00
root	7663b5c36e	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:45 +00:00
Viktor Barzin	43affc3cdc	actualbudget: add `enabled` flag to factory, disable emo Emo isn't using the instance and the daily bank-sync CronJob has been failing because the budget has zero accounts (deleted from the UI), triggering BankSyncStale. Adds an `enabled` toggle that gates the core Deployment + Service + Ingress + http-api + CronJob behind a single plan-time bool while preserving the PVC, so we can flip back to true later to restore the instance as-was. Also fixes a latent bug where the http-api Service was always created even when `enable_http_api=false`. Apply: 7 resources destroyed (emo deploy/svc/ingress/cf dns/http-api deploy+svc/cronjob), 0 changes for viktor/anca (moved blocks migrated their state cleanly to the new [0] addresses). Pushgateway job bank-sync-emo cleared manually; orphaned external-monitor synced out by external-monitor-sync.	2026-05-22 14:16:45 +00:00
Viktor Barzin	9fce3c7b09	terminal: per-Authentik-user OS-user isolation; deny unmapped users Restores the kernel-level isolation the pre-cutover ttyd-session.sh had, but keeps the multi-session lobby UX: - ttyd.service gets `-H X-authentik-username` back. `tmux-attach.sh` reads $TTYD_USER, looks up the local part in /etc/ttyd-user-map, denies the connection (no fallback to wizard) if there's no mapping, otherwise `sudo -n -H -u <os_user> tmux …`. Each Authentik identity → its own Unix user → its own `/tmp/tmux-<uid>/default` socket. - tmux-api scopes every request to the same OS user via the same header. Adds /whoami so the lobby HTML can preflight access and render "logged in as <os_user> (<authentik>)" instead of leaving the user to discover the deny via a reconnect loop. - Commits /etc/ttyd-user-map and the matching /etc/sudoers.d/ttyd-users fragment under files/devvm/ so future operators see one canonical source of truth. Current mappings: vbarzin → wizard, emil.barzin → emo. Adding a user is now: append a line to ttyd-user-map + a NOPASSWD sudoers line + `useradd -m`. README walks through it. No Terraform changes — this is all DevVM-side + lobby JS.	2026-05-22 14:16:45 +00:00
Viktor Barzin	aff4f67671	terminal: cut over to multi-session lobby on terminal.viktorbarzin.me Promotes the staged multi-session UX from term.viktorbarzin.me to the primary terminal.viktorbarzin.me hostname. `ttyd.service` on the DevVM moves to the same ExecStart that `ttyd-multi.service` was running: `/usr/local/bin/ttyd -W -a -t enableClipboard=true -I /usr/local/share/ttyd/index.html -p 7681 /usr/local/bin/tmux-attach.sh`. The lobby HTML supersedes the old per-user-attach index.html (ttyd-session.sh wrapper retired alongside). Terraform: retires the `terminal-multi` Service+Endpoints and the term.viktorbarzin.me ingress (Cloudflare DNS record for `term` is released by module deletion). The tmux-api Service+Endpoints stay, but its IngressRoute now matches terminal.viktorbarzin.me — same path-prefix specificity wins against the catch-all ingress. DevVM follow-up (applied manually as before — see files/devvm/README.md): restart ttyd to pick up the new unit, stop+disable ttyd-multi.service.	2026-05-22 14:16:45 +00:00
root	86a2c66c8e	Woodpecker CI deploy [CI SKIP]	2026-05-22 14:16:45 +00:00
Viktor Barzin	b1b2cb1974	terminal: add multi-tmux-session lobby on term.viktorbarzin.me (additive) New hostname term.viktorbarzin.me serves a session-picker UI that lists, creates, and kills tmux sessions. Visiting ?arg=<name> attaches to that session (auto-creates via tmux -A). Builds on a fresh ttyd instance (7685) plus a tmux-api Go binary (7684) on the DevVM, both running as User=wizard alongside (not replacing) the existing ttyd.service (7681), ttyd-ro.service (7682), and clipboard-upload (7683). Cutover of terminal.viktorbarzin.me to the multi-session setup is deferred. Terraform diff is purely additive — terminal-multi/tmux-api Service + Endpoints + ingress_multi (term.viktorbarzin.me, Authentik-gated) + an IngressRoute that path-prefixes /api/sessions/* to tmux-api with the matching strip-prefix Middleware. DevVM-side units ship under files/devvm/ with a README — manual scp + systemctl install (see files/devvm/README.md). ttyd 1.7.7 already deployed there (≥1.7 needed for -a). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:45 +00:00
Viktor Barzin	726fb25182	monitoring(wealth): paint declining segments red on growth chart Mirror the panel 5 treatment on panel 7 (Growth = market value − contribution). Second SQL column emits the growth value only when the point is part of a declining segment; field override paints it red with no fill, spanNulls=false.	2026-05-22 14:16:45 +00:00
Viktor Barzin	cc47da87b0	payslip-ingest, instagram-poster: suspend two chronic-failure cronjobs Identified during alert-noise review as steady sources of JobFailed. Suspending them stops the noise; unsuspend after the per-job blocker is cleared. * payslip-ingest/actualbudget-payroll-sync — blocked on Vault `secret/payslip-ingest` missing `actualbudget_encryption_password`. `actualbudget_api_key` and `actualbudget_budget_sync_id` were added (copied from `secret/fire-planner`) in the same session; the encryption password is not stored anywhere in Vault and needs to be populated separately. ExternalSecret sync has been failing since 2026-04-25. * instagram-poster/ig-refresh-token — the deployed image (:da5b4191) does not contain the `POST /ig-refresh-token` route; the route is defined in uncommitted working-copy changes at `instagram-poster/instagram_poster/app.py:695`. Unsuspend after the new image rolls. Each `suspend = true` line carries an inline comment with the unsuspend trigger.	2026-05-22 14:16:45 +00:00
Viktor Barzin	cbd0f71a3b	monitoring: PodImagePullBackOff alert + 2 inhibitors + JobFailed for:2h Three improvements identified in the 7d alert-noise review: A. New PodImagePullBackOff alert. `KubeletImagePullErrors` measures node-level pull error rate, which doesn't catch a single pod stuck in ImagePullBackOff — council-complaints sat broken for ~10h on 2026-05-12 without paging. The new rule fires per-pod after 30m. B. Two new inhibit_rules: - PVFillingUp (95% used, critical) suppresses PVPredictedFull (linear projection, warning) on the same PVC. Pair was producing ~24h of redundant firing per 7d. - EmailRoundtripFailing (active probe failure) suppresses EmailRoundtripStale (derivative >60min no-success). Same outage windows, ~14.5h of duplicate firing per 7d. C. JobFailed for: 30m → 2h. Most cronjobs run every 5–15min; the old 30-minute window paged on the first failed iteration before the next run could recover. 2h means "still failing across at least two cron iterations" — much more actionable. Verified live: rules loaded, inhibitors in alertmanager config, PodImagePullBackOff is currently inactive (council-complaints ImagePullBackOff actively detected — see separate fix).	2026-05-22 14:16:45 +00:00
Viktor Barzin	70292b9e23	monitoring: TraefikReplicaConfigStale — drop false-positive on stale series The initial formulation used clamp_min(min(rate[2h]), 0.0001), which made a recently-deleted pod's lingering rate=0 drive the ratio toward infinity for up to 2h until the stale series aged out of the rate window. With for: 2h, this was a near-miss for spurious firing in the immediate aftermath of restarting the bad replica (our remediation path). Tighter formulation: * 30m rate window — stale series ages out within minutes, not hours * `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12 incident) sits well above it, so true positives still trip * for: 1h — fast enough to catch the next incident, long enough that short rate dips don't flap Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0 results with the live cluster's tight rate spread (~0.00065–0.0007/s across all three Traefik replicas).	2026-05-22 14:16:45 +00:00
Viktor Barzin	165bb7258e	monitoring: detect stale Traefik replicas + reduce alert-storm cascading Two new alertmanager inhibit rules and one new Prometheus alert, informed by the 2026-05-12 incident where Traefik pod traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers vs 119 on healthy peers (stale K8s informer cache) and served 404 for ~1/3 of viktorbarzin.me traffic. * New alert TraefikReplicaConfigStale: fires when max/min reload-rate ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h for-clause tolerates legitimate post-restart ramp-up; the bug pattern persists indefinitely. * New inhibit: TraefikReplicaConfigStale suppresses the symptom alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical}, IngressErrorRate5xxHigh, TraefikHighOpenConnections, ForwardAuthFallbackActive, AnubisChallengeStoreErrors, ExternalAccessDivergence) so only the actionable root cause pages. * New inhibit: HomeAssistantDown suppresses HomeAssistantCriticalSensorUnavailable and HomeAssistantMetricsMissing — when HA itself is down, every sensor going unavailable is noise (10x firings observed in the last 12h). * Extend NodeDown and NFSServerUnresponsive target lists to also suppress HomeAssistantCriticalSensorUnavailable.	2026-05-22 14:16:45 +00:00
Viktor Barzin	448bc0c0f6	k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:45 +00:00

1 2 3 4 5 ...

3274 commits