Commit graph

3271 commits

Author SHA1 Message Date
Viktor Barzin
2e52583abd Phase 1a: enroll 4 self-hosted services in Keel auto-update
Enrolls the cleanest Woodpecker-build-only self-hosted services into
the inject-keel-annotations ClusterPolicy by labeling their namespaces
keel.sh/enrolled=true. CI already pushes :latest (auto_tag: true) on
each, so Keel will detect the current upstream digest and trigger a
rolling restart when polling starts (1h cadence).

Per-Deployment lifecycle extended with KYVERNO_LIFECYCLE_V2 to suppress
the annotation drift Kyverno will inject (keel.sh/policy, /trigger,
/pollSchedule).

Services included:
  - fire-planner
  - job-hunter
  - payslip-ingest
  - recruiter-responder

Skipped from Phase 1 for follow-up:
  - claude-agent-service (user has WIP on main.tf)
  - claude-memory (Postgres co-deployed; treat in Phase 9 with other DBs)
  - kms (two Deployments; needs per-resource review)
  - wealthfolio (sync sidecar pattern; needs review)
  - chrome-service (deliberate :v4 pin; needs keel.sh/policy: never label)
  - GHA-migrated repos (10) (need per-repo CI cleanup)
  - beadboard, freedify (no CI)

See docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
5acfab5bb9 recruiter-responder: bump image_tag to f3cb91ff (180d research_cache TTL)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
e5a65c11a9 recruiter-triage v3: Perks & Office Life section + cache-first deep_research
- claude-agent-service bumped to f764fef6 (agent system prompt adds
  the Perks block: food/health/pension/equity/PTO/parental/equipment/
  learning/wellness/amenities/commuter). 1200-word cap.
- recruiter-responder bumped to 38a2cdaa (cache-first deep_research:
  serves cached payload if fetched_at + ttl_seconds > now; cache
  writes upsert; new force flag bypasses).

Verified end-to-end: deep_research on Datadog now returns full Perks
section (~220s, $0.60, 23 turns). Earlier 500 fixed (was
uq_research_company_tier dup-key on re-run).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
020f62555b Phase 0: install Keel + Kyverno auto-update annotation injector
Foundation for opt-out-pure auto-update model per
docs/plans/2026-05-16-auto-upgrade-apps-{design,plan}.md.

- New stack `stacks/keel/` deploys Keel via Helm (charts.keel.sh, v1.0.6).
  Polls registries hourly per design decision #8. Default schedule
  overridable per-workload via keel.sh/pollSchedule annotation.
- New Kyverno ClusterPolicy `inject-keel-annotations` mutates Deployments,
  StatefulSets, and DaemonSets in namespaces labeled `keel.sh/enrolled=true`
  with keel.sh/policy=force + trigger=poll + pollSchedule=@every 1h.
- Phase 0 enrolls no namespaces. Phase 1 (next session) labels the
  self-hosted set.
- Per-workload opt-out: label `keel.sh/policy: never` (used by rollback
  runbook and chrome-service-style deliberate pins).
- Keel namespace excluded from the mutate — supervisor self-update has
  too-bad a failure mode (decision #11).
- AGENTS.md: KYVERNO_LIFECYCLE_V2 marker convention added for the
  ignore_changes block enrolled workloads need.
- .claude/CLAUDE.md: docker-images rule flagged as transitional.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
9476649539 docs/pm: kured silently stalled 6 days + Anubis HA lift (2026-05-16)
Captures the May 10–16 kured-vs-sentinel-gate hostPath mismatch (chart
derived hostPath from configuration.rebootSentinel) and the companion
work to harden the rolling-reboot pipeline against single-replica
PDB deadlocks: Anubis 1→2 replicas with shared Valkey store, kured
drainTimeout=30m, CNPG pg-cluster 2→3 instances. Includes the
mysql-standalone-PDB orphan cleanup and the k8s-node1 containerd-source
drift audit (benign).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
3ef860b2be kured + cnpg: drain-safe defaults ahead of Monday reboot wave
Three defensive moves to make the kured rolling-reboot cycle survive
edge cases without operator intervention:

kured (stacks/kured/main.tf):
  - Set `configuration.drainTimeout = "30m"`. Default is unlimited; if
    a future PDB or finalizer stalls drain, kured retries forever and
    the node stays cordoned silently. 30m caps the silent-failure
    window — after timeout kured logs the abort and waits for the
    next period; the node stays Schedulable so cluster capacity isn't
    lost. Lets us fail closed instead of fail-silent.

CNPG pg-cluster (stacks/dbaas/modules/dbaas/main.tf):
  - Bump instances 2 → 3 (1 primary + 2 replicas). With 2 instances the
    failover during a primary-node drain depended on the lone replica
    being caught up; a WAL backlog would stall the drain until the
    replica was current. With 3 instances CNPG always has at least one
    fully-current replica to promote, and the PDB's
    `minAvailable=1` on the primary selector is satisfied throughout
    the switchover. Storage: +20Gi PVC on proxmox-lvm-encrypted (about
    35Gi after autoresize). Memory: +3Gi pod limit.
  - Updated the `triggers.instances` so the null_resource's local-exec
    actually re-applies the YAML (kubectl apply with the new spec). The
    YAML is the source-of-truth but the trigger is what tells terraform
    to re-run the provisioner.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:48 +00:00
Viktor Barzin
4ff3638065 state(dbaas): update encrypted state 2026-05-22 14:16:48 +00:00
Viktor Barzin
08bf5e47b7 state(dbaas): update encrypted state 2026-05-22 14:16:48 +00:00
Viktor Barzin
5768216d0e anubis: HA with shared valkey/redis store + replicas=2
Anubis pre-2026-05-16 ran at replicas=1 because in-flight PoW challenge
state lived in process memory — a challenge issued by pod A wouldn't be
verifiable by pod B (HTTP 500 "store: key not found"). The PDB at
`minAvailable=1` made this worse: with replicas=1 the eviction API can
NEVER satisfy the constraint, so every drain on a node hosting an Anubis
pod looped forever. This is what stalled the manual K8s upgrade on
2026-05-11 (had to delete pods directly to bypass eviction) and was
about to block kured on Monday 2026-05-18 once the kured sentinel fix
landed.

Anubis upstream has first-class support for a Valkey/Redis-protocol
shared store (documented as the "Kubernetes worker pool" pattern).
Wire it up:

- modules/kubernetes/anubis_instance: add `shared_store_url` variable.
  When set, appends a `store: { backend: valkey, parameters: { url } }`
  block to the rendered policy YAML and defaults replicas to 2 (capped
  at 2). PDB switched from `minAvailable=1` to `maxUnavailable=1` so
  drains can take down one pod at a time. topologySpreadConstraint
  tightened to `DoNotSchedule` so the two replicas land on different
  nodes — a single node loss never takes a whole Anubis instance down.
- All 8 call sites (cyberchef, jsoncrack, kms, homepage, blog,
  travel_blog, real-estate-crawler, f1-stream) opted in. Each picks a
  unique Redis DB index (5–12) on `redis-master.redis:6379`. Cluster
  Redis already runs HA via Sentinel + haproxy, no new infra needed.

Verified: every Anubis Deployment now 2/2 Ready with pods on different
nodes; PDBs allow 1 disruption; Redis DBs 5,7,8,10 already populated
by live traffic post-apply; Palo Alto Networks scanner hit blog right
after apply and the challenge log shows the new state path.

Drain on any worker now succeeds without a `predrain_unstick` workaround
— eviction API is satisfied because at most one pod is unavailable at a
time, and the other replica keeps serving. Monday's kured reboot wave
should roll through cleanly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
3025879478 claude-agent-service: ship recruiter-triage agent + restore missing terragrunt.hcl
- main.tf: bump image_tag to 1b3350c0 (carries the new agent),
  init container  also copies recruiter-triage.md
  into /home/agent/.claude/agents/.
- terragrunt.hcl: restored (file was missing — apply was blocked).
  Standard root include + platform/vault/external-secrets dependencies.

Smoke-tested 2026-05-16: deep_research call on Datadog (thread 42)
via recruiter-responder REST API → 102.5s, $0.43, structured
markdown report with comp bands vs £600k floor, culture signals,
remote policy, recent news, sources cited. End-to-end Tier-2 is live.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
ea2342b8e2 docs: add CONTEXT.md domain glossary [ci skip]
Adds the per-repo domain glossary that engineering skills
(diagnose, tdd, improve-codebase-architecture, grill-with-docs)
read before working in this repo. Terms only — no implementation
detail. Six clusters (code organization, cluster, networking,
storage, secrets, CI/CD), 22 terms, plus relationships, an example
dialogue, and five flagged ambiguities.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
ce5f3ec209 recruiter-responder: expose Gmail IMAP creds for backtest CLI
Pulls vbarzin@gmail.com app password from secret/recruiter-responder
(seeded from secret/wealthfolio.imap_password — same Gmail credential
that wealthfolio uses for broker-statement ingestion). Env vars
GMAIL_IMAP_USER + GMAIL_IMAP_PASS, consumed by 'backtest gmail'.

Backtest verified 2026-05-16 against folder
'companies-I-dont-take-seriously': 20/20 recruiter, 100% company
extraction (9 stated, 6 subject, 4 sender_domain, 1 body), 30% comp,
avg 12s latency.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
Viktor Barzin
065982d978 kured: fix sentinel path mismatch that stalled rolling reboots
The kured Helm chart derives the sentinel hostPath from
`dirname(configuration.rebootSentinel)`. Previously
rebootSentinel=/sentinel/gated-reboot-required pointed hostPath at
`/sentinel/` (an empty auto-created directory on every host) while the
kured-sentinel-gate DaemonSet writes to /var/run/gated-reboot-required.

Two different host directories → kured never saw the open gate, even
though the gate's checks were all green every 5 min on every node.
Result: unattended-upgrades has packages waiting on every node since
2026-05-10 (when uu was re-enabled) and kured's hourly log says
"Reboot not required" for the entire period.

Set rebootSentinel=/var/run/gated-reboot-required so the chart mounts
hostPath /var/run — same directory the gate writes to. The in-pod
mountPath (/sentinel) is hardcoded by the chart and doesn't matter,
the symlink chain works out: /sentinel/<file> inside the pod resolves
to /var/run/<file> on the host.

Verified: kured pod can now list /sentinel/gated-reboot-required
(0 B) AND /sentinel/reboot-required (32 B, set by uu on 2026-05-15).
First gated reboot will land Mon 2026-05-18 02:00 London.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
80e6314bf0 recruiter-responder: bump image_tag to 559e5c57
PDF extraction, tech_stack list, aggressive company/comp inference,
no-phone-call drafts, backtest CLI.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
8e11caff8d recruiter-responder: bump image_tag to bbd178da (structured Telegram + comp floor)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
Viktor Barzin
391c002f9a service-catalog: add aiostreams entry
Stremio stream aggregator now has its own row in the Active Use tier.
Captures the auth model (own UUID+password, not Authentik), monitoring
posture (canary probe + 3 alerts), and backup pipeline (weekly NFS
dumps of both decrypted config and the Stremio account addon
collection).

Follow-up from the 2026-05-15/16 hardening session: 5 commits on
servarr/aiostreams, none previously catalogued.
2026-05-22 14:16:47 +00:00
Viktor Barzin
24ce3e267d aiostreams: weekly backup of Stremio account addon collection
Adds stremio-account-backup CronJob (Sun 04:00 weekly, offset 1h from
the AIOStreams config-backup at 03:00):

- Logs into api.strem.io with credentials from Vault
  (secret/viktor.stremio_email + stremio_password, now also synced
  into the aiostreams-probe-secrets ExternalSecret)
- Fetches the full addonCollection via addonCollectionGet
- Writes timestamped JSON to the existing aiostreams-backup PVC
  (NFS /srv/nfs/aiostreams-backup/stremio-collection-*.json, mode 0600)
- 90-day retention, logs out to invalidate the auth key
- Pushgateway metrics: stremio_account_backup_{success,bytes,
  addon_count,duration_seconds,last_run_timestamp}

Protects against: accidental "uninstall all" / API regression / wrong
account login wiping the curated set of 22 addons (Cinemeta + 16
MDBList + AIOStreams + More Like This + Formulio + Zamunda + Local).

Verified: manual run wrote 93480 bytes, 22 addons, file present on NFS.
2026-05-22 14:16:47 +00:00
aa6e9b0242 recruiter-responder: public /cb ingress for Telegram URL-button callbacks
- Add ingress_factory module (auth=none, HMAC + expiry are the gate);
  ingress_path=["/cb"] only — /api stays internal, /healthz cluster.
  dns_type=proxied. anti_ai_scraping=false.
- Drop setup_tls_secret module — Kyverno ClusterPolicy `sync-tls-secret`
  auto-clones the wildcard cert into every namespace.
- Bump image_tag to 7383b426 (callback endpoints + SMTP STARTTLS
  hostname relax).
- Wire CALLBACK_BASE_URL=https://recruiter-responder.viktorbarzin.me.
- Drop git-crypt-encrypted wildcard cert files into
  stacks/recruiter-responder/secrets/. Allowlist privkey.pem in a new
  .gitleaksignore — git-crypt encrypts at rest but the working-tree
  copy is plaintext, so gitleaks can't tell.

Smoke-tested end-to-end 2026-05-15 23:45:
  synthetic email -> Telegram with / buttons ->  tapped via curl
  -> 'Sent' HTML page -> thread.status=sent, decision row recorded
  with decided_via=telegram_button, outbound message threaded correctly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
Viktor Barzin
77010b769a aiostreams: whitelist Vidhin + Tamtaro sync URLs
Adds two env vars on the AIOStreams deployment:
- WHITELISTED_REGEX_PATTERNS_URLS: Vidhin's release-group regex
  (TRaSH-aligned) so syncedRankedRegexUrls works for the user
- WHITELISTED_SEL_URLS: Vidhin's ranked stream expressions +
  Tamtaro's ISE/PSE/ESE-standard

Gotcha: AIOStreams validates each synced* field against the matching
whitelist — stream-expression files (incl. Vidhin's expressions.json)
go in WHITELISTED_SEL_URLS, not the regex one, even though they live
in Vidhin's regex repo. Mixing them up returns USER_INVALID_CONFIG.

User config: enabled Vidhin's regex + ranked expressions + Tamtaro's
ISEs. Skipped Tamtaro PSE/ESE for now to avoid surprise over-filtering;
can be added later from the same whitelist.
2026-05-22 14:16:47 +00:00
Viktor Barzin
c396092c86 aiostreams: weekly NFS backup of decrypted user config
Adds aiostreams-config-backup CronJob (Sun 03:00 weekly):
- Pulls /api/v1/user via internal ClusterIP with UUID + password from
  the existing aiostreams-probe-secrets ExternalSecret
- Writes timestamped JSON to nfs-backup PVC mounted at /backup
- 90-day retention, prunes older files
- Pushgateway metrics: aiostreams_config_backup_{success,bytes,duration,last_run_timestamp}

NFS path: 192.168.1.127:/srv/nfs/aiostreams-backup (auto-synced offsite
to Synology via the existing offsite-sync-backup CronJob).

Complements the daily postgresql-backup-per-db pipeline (which dumps
the encrypted blob) by storing the decrypted JSON — usable for human
inspection / disaster recovery even without the AIOStreams password.

Verified: manual job wrote 12931 bytes, file present on NFS.
2026-05-22 14:16:47 +00:00
root
1177a82452 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:47 +00:00
a98b00324d recruiter-responder: pin image tag + run plugin installer init as root
- stacks/recruiter-responder/terragrunt.hcl: bump image_tag to 0500c3d3
  (300s LLM timeouts + IMAP BODY.PEEK[] fix).
- stacks/openclaw/main.tf: install-recruiter-plugin init container now
  runs as uid 0 — the openclaw NFS volume is owned by uid 1000 and the
  recruiter-responder image otherwise drops to uid 10001 which can't
  write or chown.

Smoke-tested end-to-end 2026-05-15 ~23:15:
  Synthetic recruiter email -> IMAP IDLE EXISTS push -> qwen3-8b triage
  (12.1s, JSON output complete with company/role/salary/location/tech)
  -> 2 drafts persisted in Postgres -> Telegram sendMessage 200 OK.
Then deleted 3 stale n8n workflows W992Nr7..., 1AU4k7..., IisDNx... from
the n8n Postgres workflow_entity table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:47 +00:00
a72590db7d recruiter-responder: vault DB role + switch proactive push to Telegram
- stacks/vault/main.tf: register pg-recruiter-responder static role on
  the postgresql connection (7d password rotation). Adds the role to
  allowed_roles and creates vault_database_secret_backend_static_role
  for `recruiter_responder` user.
- stacks/recruiter-responder/main.tf: drop TASK_WEBHOOK_URL env, swap
  TASK_WEBHOOK_TOKEN secret for TELEGRAM_BOT_TOKEN + TELEGRAM_CHAT_ID.
  Updated header doc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:46 +00:00
Viktor Barzin
89e9471e87 state(vault): update encrypted state 2026-05-22 14:16:46 +00:00
7e1580ba8c recruiter-responder: deploy stack + llama-cpp qwen3-8b + openclaw plugin mount
Three coupled changes for the new recruiter-responder pipeline:

1. stacks/llama-cpp/: add qwen3-8b text-only model to llama-swap. Uses
   unsloth/Qwen3-8B-GGUF Q4_K_M, 16k context, no mmproj. Refactored the
   download Job script + cmd renderer to handle text_only=true (skip
   mmproj download + --mmproj flag). The 3 existing vision models stay
   on text_only=false; no behaviour change for them.

2. stacks/recruiter-responder/: new stack. Namespace, 2 ExternalSecrets
   (app secrets from secret/recruiter-responder, DB creds from Vault DB
   engine static-creds/pg-recruiter-responder), Deployment (replicas=1,
   Recreate -- IMAP IDLE + APScheduler want single leader), Service
   ClusterIP. Image: forgejo.viktorbarzin.me/viktor/recruiter-responder.

3. stacks/openclaw/: add init container `install-recruiter-plugin` that
   uses the recruiter-responder image to copy the .mjs plugin into
   /home/node/.openclaw/extensions/recruiter-api/ on NFS. Couples plugin
   version to the recruiter-responder image tag. Also injects
   RECRUITER_RESPONDER_URL + RECRUITER_RESPONDER_TOKEN env vars (token
   from openclaw-secrets.recruiter_responder_bearer_token, optional).

Pre-apply checklist for recruiter-responder stack:
  - Vault: seed secret/recruiter-responder with webhook_bearer_token,
    imap_{me,spam}_{user,pass}, smtp_password, claude_agent_token,
    task_webhook_token.
  - Vault: add secret/openclaw.recruiter_responder_bearer_token (same as
    above webhook_bearer_token).
  - dbaas: create DB recruiter_responder + role recruiter_responder,
    and Vault DB-engine role static-creds/pg-recruiter-responder.
  - Build + push image via Woodpecker (recruiter-responder repo CI).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:46 +00:00
Viktor Barzin
95b9f7bc89 aiostreams: 1h stream cache + canary stream-count probe + 3 alerts
Hardening pass following the empty-stream-list incident:

1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 /
   disabled). Default behaviour hit all 5 upstream addons on every
   Stremio request; with a 1h TTL repeat requests for the same title
   are instant, while RD cache invalidations still propagate quickly.

2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's
   encryptedPassword via the internal ClusterIP, runs a canary stream
   search for Breaking Bad S01E01, pushes streams_count + probe_success
   to Pushgateway. Uses an ExternalSecret pulling UUID + password from
   Vault secret/viktor. Same pattern as email-roundtrip-monitor.

3. Three alerts in monitoring's prometheus_chart_values.tpl:
   - AIOStreamsStreamCountLow  (< 50 streams for 30m)
   - AIOStreamsProbeFailing    (probe_success == 0 for 30m)
   - AIOStreamsProbeStale      (last_run_timestamp > 30min for 10m)

Verified: probe returned streams=411 success=1 on first run; all 3
alerts loaded into Prometheus with state=inactive health=ok.
2026-05-22 14:16:46 +00:00
root
fba5ee2df4 Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:46 +00:00
Viktor Barzin
c73234982f aiostreams: pin nightly + switch to auth=app
- Pin viren070/aiostreams:nightly → :2026.05.14.1326-nightly (avoid
  stale-pull cache, matches 8-char SHA convention for rolling tags)
- Switch ingress auth tier required → app: Authentik forward-auth
  blocks Stremio clients (cannot follow OAuth 302), and AIOStreams
  already enforces UUID + password on /configure and /api/*, with
  Stremio addon URLs using encryptedPassword as a bearer token.
  Result: empty-stream-list issue fixed for public Stremio clients.

Verified: 410 streams returned via public URL for Breaking Bad S01E01
with no cookies, vs 0 before (502→Authentik OIDC redirect).
2026-05-22 14:16:46 +00:00
2903ab9778 monitoring(wealth): move Positions table under contrib/growth row
Positions panel now sits at y=32 (immediately below the
contrib-vs-market + growth row at y=22..32), and everything from
the per-account stack down shifts 8 rows lower.
2026-05-22 14:16:46 +00:00
8461275308 wealth: positions table panel (shares + cost basis + unrealised return)
pg-sync sidecar now mirrors three extra views from the wealthfolio
SQLite: assets (id/symbol/name/currency), quote_latest (one row per
asset, preferring YAHOO over MANUAL on same-day collisions), and
positions_latest (currently-held positions extracted from the TOTAL
aggregate row of holdings_snapshots — quantity, average cost,
total cost basis).

Wealth dashboard gets a new bottom Positions table joining the three:
symbol, name, shares, avg cost, last price, market value, cost,
gain, return %. Gain and return % are color-text with red<0, green>=0
thresholds.
2026-05-22 14:16:46 +00:00
d6049ff7a0 terminal: extract app code to viktor/terminal-lobby on Forgejo
The lobby has grown enough (frontend, two Go services, devvm units +
scripts + config) that it earns its own repo. Code now lives at
https://forgejo.viktorbarzin.me/viktor/terminal-lobby with
scripts/deploy.sh covering the manual deploy until CI activation
lands (Woodpecker forge_id=2 activation still 500s; Forgejo Actions
not yet enabled).

This stack now owns only the K8s side — Services, Endpoints,
IngressRoutes, middlewares. main.tf comment block updated to point
at the new repo and the full DevVM port map.

Removed:
- stacks/terminal/files/        (index.html + DevVM artefacts)
- stacks/terminal/tmux-api/     (Go service)
- stacks/terminal/clipboard-upload/ (Go service)
2026-05-22 14:16:46 +00:00
c135c04c79 terminal: make slate the default theme 2026-05-22 14:16:46 +00:00
a44aa52e1a terminal: theme picker (carbon/slate/mono/ink) replacing violet
Drops the hardcoded violet/indigo palette. Four themes are defined as
CSS variables on body.theme-{carbon,slate,mono,ink}:

- Carbon (default): warm dark, ivory text, restrained amber accent.
- Slate: cool dark, GitHub/Linear-ish charcoal with electric blue.
- Mono: strict greyscale, off-white accent.
- Ink: warm paper light, deep ink, terracotta accent.

The lobby reads the choice from localStorage and applies the class
before render. The picker lives at the bottom of the sidebar
(margin-top: auto pins it). On change, the iframe is bounced through
about:blank so the inner xterm picks up the new computed CSS vars
(--terminal-bg/fg/cursor/selection) on the next mount.

Picker UI uses native buttons, current theme highlighted with the
accent border + color. No gradients, hairline borders only.
2026-05-22 14:16:45 +00:00
cbe83597c0 terminal: rename sessions + drag-and-drop reorder
Backend: POST /sessions/<name>/rename in tmux-api runs tmux
rename-session as the mapped OS user. 400 on bad name, 404 on missing
source, 409 on duplicate target, 401 on missing auth header.

Frontend:
- Rename button per card → prompt() dialog, validates against the
  shared regex. Updates currentActive + hash + iframe.src if the
  renamed session was active.
- Session order is now user-driven, persisted in localStorage
  keyed per osUser. New sessions append at the bottom. The previous
  sort-by-lastActivity is gone.
- HTML5 drag-and-drop reorders cards live during dragover; dragend
  captures the DOM order into localStorage.
- Polling renderLobby is suppressed while a drag is in flight so the
  5s tick doesn't yank the list out from under the user.
2026-05-22 14:16:45 +00:00
Viktor Barzin
04fd241679 terminal: inline session switching via sidebar + iframe
Replace full-page navigation with a two-pane lobby. Sidebar holds the
session list as clickable cards; an iframe in the content pane swaps
its src on click so switching sessions takes one click instead of two
navigations.

- #lobby-shell grid (260px sidebar + iframe pane)
- Cards become role=button, kill button stops propagation
- activateSession/deactivateSession with hash routing
  (location.hash <-> active session, replaceState so back stack stays
  clean)
- Killed active session deactivates the iframe before re-render
- 5s session poll preserves currentActive; deactivates if gone
- Mobile media query collapses to one column

CSP frame-ancestors already permits same-origin embedding
(*.viktorbarzin.me), no infra changes needed. Direct-link
?arg=<name> path is unchanged.
2026-05-22 14:16:45 +00:00
root
7663b5c36e Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
Viktor Barzin
43affc3cdc actualbudget: add enabled flag to factory, disable emo
Emo isn't using the instance and the daily bank-sync CronJob has been
failing because the budget has zero accounts (deleted from the UI),
triggering BankSyncStale. Adds an `enabled` toggle that gates the core
Deployment + Service + Ingress + http-api + CronJob behind a single
plan-time bool while preserving the PVC, so we can flip back to true
later to restore the instance as-was.

Also fixes a latent bug where the http-api Service was always created
even when `enable_http_api=false`.

Apply: 7 resources destroyed (emo deploy/svc/ingress/cf dns/http-api
deploy+svc/cronjob), 0 changes for viktor/anca (moved blocks
migrated their state cleanly to the new [0] addresses). Pushgateway
job bank-sync-emo cleared manually; orphaned external-monitor
synced out by external-monitor-sync.
2026-05-22 14:16:45 +00:00
9fce3c7b09 terminal: per-Authentik-user OS-user isolation; deny unmapped users
Restores the kernel-level isolation the pre-cutover ttyd-session.sh had,
but keeps the multi-session lobby UX:

- ttyd.service gets `-H X-authentik-username` back. `tmux-attach.sh` reads
  $TTYD_USER, looks up the local part in /etc/ttyd-user-map, denies the
  connection (no fallback to wizard) if there's no mapping, otherwise
  `sudo -n -H -u <os_user> tmux …`. Each Authentik identity → its own
  Unix user → its own `/tmp/tmux-<uid>/default` socket.
- tmux-api scopes every request to the same OS user via the same header.
  Adds /whoami so the lobby HTML can preflight access and render
  "logged in as <os_user> (<authentik>)" instead of leaving the user to
  discover the deny via a reconnect loop.
- Commits /etc/ttyd-user-map and the matching /etc/sudoers.d/ttyd-users
  fragment under files/devvm/ so future operators see one canonical
  source of truth. Current mappings: vbarzin → wizard, emil.barzin → emo.

Adding a user is now: append a line to ttyd-user-map + a NOPASSWD
sudoers line + `useradd -m`. README walks through it.

No Terraform changes — this is all DevVM-side + lobby JS.
2026-05-22 14:16:45 +00:00
aff4f67671 terminal: cut over to multi-session lobby on terminal.viktorbarzin.me
Promotes the staged multi-session UX from term.viktorbarzin.me to the
primary terminal.viktorbarzin.me hostname. `ttyd.service` on the DevVM
moves to the same ExecStart that `ttyd-multi.service` was running:
`/usr/local/bin/ttyd -W -a -t enableClipboard=true -I
/usr/local/share/ttyd/index.html -p 7681 /usr/local/bin/tmux-attach.sh`.
The lobby HTML supersedes the old per-user-attach index.html
(ttyd-session.sh wrapper retired alongside).

Terraform: retires the `terminal-multi` Service+Endpoints and the
term.viktorbarzin.me ingress (Cloudflare DNS record for `term` is
released by module deletion). The tmux-api Service+Endpoints stay, but
its IngressRoute now matches terminal.viktorbarzin.me — same path-prefix
specificity wins against the catch-all ingress.

DevVM follow-up (applied manually as before — see files/devvm/README.md):
restart ttyd to pick up the new unit, stop+disable ttyd-multi.service.
2026-05-22 14:16:45 +00:00
root
86a2c66c8e Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
Viktor Barzin
b1b2cb1974 terminal: add multi-tmux-session lobby on term.viktorbarzin.me (additive)
New hostname term.viktorbarzin.me serves a session-picker UI that lists,
creates, and kills tmux sessions. Visiting ?arg=<name> attaches to that
session (auto-creates via tmux -A). Builds on a fresh ttyd instance
(7685) plus a tmux-api Go binary (7684) on the DevVM, both running as
User=wizard alongside (not replacing) the existing ttyd.service (7681),
ttyd-ro.service (7682), and clipboard-upload (7683). Cutover of
terminal.viktorbarzin.me to the multi-session setup is deferred.

Terraform diff is purely additive — terminal-multi/tmux-api Service +
Endpoints + ingress_multi (term.viktorbarzin.me, Authentik-gated) + an
IngressRoute that path-prefixes /api/sessions/* to tmux-api with the
matching strip-prefix Middleware.

DevVM-side units ship under files/devvm/ with a README — manual scp +
systemctl install (see files/devvm/README.md). ttyd 1.7.7 already
deployed there (≥1.7 needed for -a).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:45 +00:00
726fb25182 monitoring(wealth): paint declining segments red on growth chart
Mirror the panel 5 treatment on panel 7 (Growth = market value −
contribution). Second SQL column emits the growth value only when
the point is part of a declining segment; field override paints it
red with no fill, spanNulls=false.
2026-05-22 14:16:45 +00:00
Viktor Barzin
cc47da87b0 payslip-ingest, instagram-poster: suspend two chronic-failure cronjobs
Identified during alert-noise review as steady sources of JobFailed.
Suspending them stops the noise; unsuspend after the per-job blocker is
cleared.

* payslip-ingest/actualbudget-payroll-sync — blocked on Vault
  `secret/payslip-ingest` missing `actualbudget_encryption_password`.
  `actualbudget_api_key` and `actualbudget_budget_sync_id` were added
  (copied from `secret/fire-planner`) in the same session; the
  encryption password is not stored anywhere in Vault and needs to be
  populated separately. ExternalSecret sync has been failing since
  2026-04-25.

* instagram-poster/ig-refresh-token — the deployed image (:da5b4191)
  does not contain the `POST /ig-refresh-token` route; the route is
  defined in uncommitted working-copy changes at
  `instagram-poster/instagram_poster/app.py:695`. Unsuspend after the
  new image rolls.

Each `suspend = true` line carries an inline comment with the unsuspend
trigger.
2026-05-22 14:16:45 +00:00
Viktor Barzin
cbd0f71a3b monitoring: PodImagePullBackOff alert + 2 inhibitors + JobFailed for:2h
Three improvements identified in the 7d alert-noise review:

A. New PodImagePullBackOff alert. `KubeletImagePullErrors` measures
   node-level pull error rate, which doesn't catch a single pod stuck
   in ImagePullBackOff — council-complaints sat broken for ~10h on
   2026-05-12 without paging. The new rule fires per-pod after 30m.

B. Two new inhibit_rules:
   - PVFillingUp (95% used, critical) suppresses PVPredictedFull
     (linear projection, warning) on the same PVC. Pair was producing
     ~24h of redundant firing per 7d.
   - EmailRoundtripFailing (active probe failure) suppresses
     EmailRoundtripStale (derivative >60min no-success). Same outage
     windows, ~14.5h of duplicate firing per 7d.

C. JobFailed for: 30m → 2h. Most cronjobs run every 5–15min; the old
   30-minute window paged on the first failed iteration before the
   next run could recover. 2h means "still failing across at least
   two cron iterations" — much more actionable.

Verified live: rules loaded, inhibitors in alertmanager config,
PodImagePullBackOff is currently inactive (council-complaints
ImagePullBackOff actively detected — see separate fix).
2026-05-22 14:16:45 +00:00
Viktor Barzin
70292b9e23 monitoring: TraefikReplicaConfigStale — drop false-positive on stale series
The initial formulation used clamp_min(min(rate[2h]), 0.0001), which
made a recently-deleted pod's lingering rate=0 drive the ratio toward
infinity for up to 2h until the stale series aged out of the rate
window. With for: 2h, this was a near-miss for spurious firing in the
immediate aftermath of restarting the bad replica (our remediation
path).

Tighter formulation:
* 30m rate window — stale series ages out within minutes, not hours
* `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod
  ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12
  incident) sits well above it, so true positives still trip
* for: 1h — fast enough to catch the next incident, long enough that
  short rate dips don't flap

Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0
results with the live cluster's tight rate spread (~0.00065–0.0007/s
across all three Traefik replicas).
2026-05-22 14:16:45 +00:00
Viktor Barzin
165bb7258e monitoring: detect stale Traefik replicas + reduce alert-storm cascading
Two new alertmanager inhibit rules and one new Prometheus alert,
informed by the 2026-05-12 incident where Traefik pod
traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers
vs 119 on healthy peers (stale K8s informer cache) and served 404 for
~1/3 of viktorbarzin.me traffic.

* New alert TraefikReplicaConfigStale: fires when max/min reload-rate
  ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h
  for-clause tolerates legitimate post-restart ramp-up; the bug
  pattern persists indefinitely.
* New inhibit: TraefikReplicaConfigStale suppresses the symptom
  alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical},
  IngressErrorRate5xxHigh, TraefikHighOpenConnections,
  ForwardAuthFallbackActive, AnubisChallengeStoreErrors,
  ExternalAccessDivergence) so only the actionable root cause pages.
* New inhibit: HomeAssistantDown suppresses
  HomeAssistantCriticalSensorUnavailable and
  HomeAssistantMetricsMissing — when HA itself is down, every sensor
  going unavailable is noise (10x firings observed in the last 12h).
* Extend NodeDown and NFSServerUnresponsive target lists to also
  suppress HomeAssistantCriticalSensorUnavailable.
2026-05-22 14:16:45 +00:00
Viktor Barzin
448bc0c0f6 k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:45 +00:00
root
8e13f1528e Woodpecker CI deploy [CI SKIP] 2026-05-22 14:16:45 +00:00
Viktor Barzin
e8854f9230 wealthfolio, paperless-ngx: drop migration-leftover -proxmox PVCs
The 2026-04-13 encrypted-PVC migration replaced the wealthfolio and
paperless-ngx data volumes with -encrypted variants but never removed
the original -proxmox PVC blocks from TF — both were sitting orphaned
with no pod mounting them, occupying 1Gi each of LVM thin pool. The
autoresizer also logged repeated "failed to get volume stats" for them
(no kubelet stats without a mounted pod), masking real signal.

  * wealthfolio: removed kubernetes_persistent_volume_claim.data_proxmox
  * paperless-ngx: removed kubernetes_persistent_volume_claim.data_proxmox
  (the paperless PVC turned out to be out-of-TF-state, so deleted via
   kubectl after the TF block removal.)
2026-05-22 14:16:45 +00:00
Viktor Barzin
701b0e3c57 claude-agent: replace unused 10Gi PVC with 5Gi NFS-backed /persistent
The 10Gi proxmox-lvm-encrypted PVC `claude-agent-workspace-encrypted` was
declared in TF but never wired into the deployment — the `workspace`
volume_mount pointed at an emptyDir, so the PVC sat allocated and idle
from 2026-04-15 to 2026-05-11.

Restructured per the design intent:
  * `workspace` (emptyDir) — fast per-pod ephemeral scratch for git clones.
    Each agent job clones the infra repo fresh, so persistence doesn't
    buy anything and emptyDir avoids RWO contention if the deployment
    is ever scaled past 1 replica.
  * `persistent` (5Gi NFS-backed RWX) — mounted at /persistent for cases
    where the agent needs to write state that should survive pod
    restarts (caches, ad-hoc outputs). RWX so all replicas share it;
    the service's sequential-mutex lock prevents concurrent writes.

Also fixed `fix-perms` init container: the Dockerfile's `WORKDIR
/workspace/infra` causes kubelet to create that path inside the
emptyDir as root:fsGroup with the setgid bit, which uid 1000 can't
write to. Pre-create the path + chmod 0775 to make it writable.

NFS export already exists on the PVE host
(/srv/nfs/claude-agent-persistent, owned 1000:1000).

Verified: pod runs 1/1; `/persistent` writable as agent uid 1000;
git-init successfully clones infra into /workspace/infra.
2026-05-22 14:16:45 +00:00