Commit graph

1613 commits

Author SHA1 Message Date
Viktor Barzin
28984dda9a monitoring/wealth: add per-year effective hourly-rate panel (gross vs net)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted to see, on the wealth dashboard, the hourly wage he earned
each year - both gross and net - with year on the X axis.

New timeseries (line) panel "Effective hourly rate - gross vs net":
- hourly = annual pay / hours worked; hours = contractual 40h/week
  (2,080h per full year, confirmed from the Facebook/Meta UK offer letter:
  Mon-Fri 09:00-18:00 less a 1h lunch), prorated by the months actually
  worked so partial years (2019, 2020, 2026) read correctly.
- Gross = gross_pay incl. notional RSU vest; Net = take-home.
- timeFrom 10y so all years show under the dashboard's default 180d range.

Source data: a duplicate March-2023 payslip (Paperless doc 347, a re-upload
of doc 33) was removed separately, so 2023 is no longer double-counted; this
also corrects the existing net-pay panel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:28:46 +00:00
Viktor Barzin
82371d1ef8 dbaas/mysql: innodb_doublewrite=DETECT_ONLY to halve page-flush writes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
MySQL device-write investigation (code-oflt): after the nextcloud webcal
throttle settled (the earlier 3.4-8.8 MB/s were post-restart transients),
MySQL is ~1.74 MB/s at the InnoDB level — and HALF of that (~0.86 MB/s,
~55 pages/s) is the doublewrite buffer writing every flushed page twice.
Redo is negligible (0.01 MB/s), no temp-table spilling.

Set innodb_doublewrite=DETECT_ONLY (dynamic, no restart; persisted in the
cnf): InnoDB stops writing full page CONTENT to the doublewrite buffer
(~halves MySQL's page-flush writes on the IOPS-bound sdc) but keeps
torn-page DETECTION metadata — a crash-torn page is flagged on recovery
(restore from the daily mysqldump) rather than silently corrupt. Chosen
over full OFF: same write saving, keeps detection, and OFF requires a
shutdown ("cannot change to OFF if doublewrite is enabled"). Acceptable
risk given the PERC BBU cache + UPS (in-flight writes complete on power
loss) + daily per-db backups.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:47:09 +00:00
Viktor Barzin
82c9e69b77 dbaas/mysql: 2Gi InnoDB buffer pool + 6Gi limit + ignore VCT drift
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Cut MySQL's write-IOPS footprint on the contended PVE sdc HDD (code-oflt).
Standalone MySQL was the #1 sdc bandwidth writer (~2.8-3.5 MB/s). Live
attribution found ~60% of its writes were nextcloud webcal calendar churn
(throttled separately at the app layer); this addresses write amplification
on the remainder:

- innodb_buffer_pool_size 1Gi -> 2Gi: the pool was too small for the ~5.6Gi
  hot set (Innodb_buffer_pool_wait_free=1.78M = threads stalling for a free
  page -> constant flush-to-make-room write IOPS).
- container memory limit 4Gi -> 6Gi (requests 3->4Gi): the pod was already
  at ~3.7Gi/4Gi (near OOM) with the 1Gi pool, so the 2Gi pool needs the
  headroom. One-time MySQL pod restart to apply.
- ignore_changes on the StatefulSet volume_claim_template: the VCT is
  immutable post-creation and pvc-autoresizer rewrites its annotations on
  the live object, so TF's desired VCT could never apply and errored every
  broad dbaas apply. Ignoring it (autoresizer owns PVC sizing) removes the
  long-standing need to -target around it.

Applied + verified live: buffer_pool=2.0GiB, limit=6Gi, pod healthy,
24 DBs reachable, restart clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:55:18 +00:00
469cdd7507 frigate: expose go2rtc on a dedicated MetalLB LB IP (RTSP 8554 + WebRTC 8555)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
HA live video from the cluster Frigate hangs/fails because the only path
to Frigate is the Traefik HTTP(S) ingress (frigate-lan -> 10.0.20.203),
which cannot carry RTSP or WebRTC. The container already listens on
8554+8555 but only RTSP had a Service (NodePort), and WebRTC (8555) was
never exposed. Convert frigate-rtsp to a LoadBalancer on a dedicated MetalLB
IP (.204, ETP=Local, pod pinned to the GPU node) carrying RTSP 8554 +
WebRTC 8555 (TCP+UDP), giving HA Sofia + LAN browsers a stable cross-VLAN
endpoint for native HLS/WebRTC live (parity with the Hikvision NVR).
Companion non-Terraform steps are in the PR body.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:15:22 +00:00
Viktor Barzin
9ea9cae073 rightsize: reconcile batch-2/3 stacks blocked by killed #427 (job-hunter, wealthfolio, f1-stream)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Memory limits were committed (batch 2/3) but pipeline #427 was killed mid-apply and the local homelab tf apply hit a stale backend-init; this comment-only diff re-triggers a clean CI apply for the three stacks so live matches master (job-hunter 768Mi, wealthfolio 512Mi, f1-stream 384Mi).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:59:41 +00:00
Viktor Barzin
7cc9cde5b1 external-secrets: enable ESO Vault token cache to cut sdc write churn
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Add --enable-vault-token-cache to the ESO controller (a graduated,
non-experimental flag in chart 2.6.0). Until now ESO authenticated to
Vault with login -> lookup-self -> revoke-self on *every* secret fetch.
Across 92 ExternalSecrets refreshing every 15m that measured ~0.22
logins/s + ~0.22 revoke-self/s on the active Vault member, and each
cycle is a token create+revoke (plus its lease) written to the Raft log
on all three members. Those fsync-heavy writes land on the contended
PVE RAID1 7200rpm HDD (sdc) -- one of the write sources behind the
recurring control-plane flaps (code-oflt write-reduction).

The eso kubernetes-auth role already issues a 240h periodic, unlimited-
use token, so the churn was pure waste: ESO discarded a perfectly good
token after a single use. With token caching ESO mints one token and
reuses/renews it, collapsing logins from ~13/min to a handful per token
lifetime. Verified live: vault cache initialized, 112/113 ExternalSecrets
Ready (the one failure, instagram-poster, is pre-existing data drift
unrelated to auth), logins dropped to ~0 after warm-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:32:37 +00:00
Viktor Barzin
bc626a2d89 rightsize: raise OOM-tight memory limits (batch 3/N — spike protection)
Some checks failed
ci/woodpecker/push/default Pipeline failed
shlink 512->704Mi, linkwarden 1Gi->1280Mi, chrome-service 2Gi->2624Mi, forgejo 4Gi->5Gi, f1-stream 256->384Mi. All were request==limit with 30d peak at 91-100% of the ceiling — a spike would OOM-kill them. Raising the limit (now Burstable, request<limit) gives real burst headroom. This is the genuine 'don't OOM on occasional spike' fix. Small add (~2.2Gi limits) vs the ~20Gi of fat removed in batches 1-2, so net overcommit keeps dropping.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:28:11 +00:00
Viktor Barzin
418d1efb4b rightsize: trim over-provisioned memory (batch 2/N)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
claude-agent-service 12Gi->3Gi (peak 585Mi — the single biggest fat, ~9Gi of limit-overcommit removed), job-hunter 1280->768Mi (kept chromium headroom; 30d peak 118Mi), fire-planner 1024->320Mi, wealthfolio 1Gi->512Mi (kept history-growth headroom). Burstable, limits kept >= generous peak headroom, never below peak. ~10.7Gi of limit overcommit removed. paperless-ai intentionally LEFT at 4Gi (documented in-process RAG model load).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:27:17 +00:00
Viktor Barzin
c3553731c7 dbaas: CNPG write-reduction — archive_timeout=0, commit_delay, wal_compression=zstd
Part of code-oflt (cut sdc write IOPS before the SSD move; analysis #6922).
- archive_timeout 300->0: CNPG forces archive_mode=on but .spec.backup is empty
  (no ObjectStore), so a 16MB WAL segment switch every 5min shipped NOWHERE =
  ~4.6 GB/day of pure-waste WAL on the contended sdc. archive_mode stays CNPG-on
  (reserved); 0 just stops the timed switch. Daily pg_dump cron unchanged.
- commit_delay 0->2500us: group-commit coalesces concurrent fsyncs. SAFE for
  every DB incl financial -- data still fsynced before COMMIT acks, only <=2.5ms
  added latency under concurrency.
- wal_compression pglz->zstd: ~30-50% smaller full-page images.
All sighup-reloadable. Applied via targeted apply of
module.dbaas.null_resource.pg_cluster (trigger bumped) to avoid the pre-existing
mysql VCT drift that breaks broad dbaas applies.

Refs: code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:16:38 +00:00
Viktor Barzin
5d059786a1 rightsize: trim over-provisioned memory limits+requests (batch 1/N)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
claude-breakglass 4Gi->512Mi, stirling-pdf 1536->512Mi, insta2spotify 2Gi->256Mi, recruiter-responder 768->256Mi. These idle/utility services had memory LIMITS sitting 4-15x above their 30d peak, inflating cluster limit-overcommit to 142% across the 5 post-node6 nodes. Burstable (request<limit), limits capped at ~peak x1.5 (never below peak), so no OOM risk (verified zero OOMKills cluster-wide in 30d). Reduces phantom limit overcommit + frees scheduler requests.

Follows the 3-reviewer adversarial review: raising limits on an already-overcommitted cluster worsens correlated node-OOM; the real fix is trimming the fat. Limits only lowered where peak is far below; tuned/DB/GPU limits untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 14:46:58 +00:00
Viktor Barzin
256122ff5b monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live).

Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:34:01 +00:00
Viktor Barzin
c0e0911afa dbaas: bump pg_cluster trigger so the checkpoint/WAL params actually apply
a2c8f906 added checkpoint_timeout=15min + max/min_wal_size to the CNPG
Cluster YAML, but the cluster is applied via null_resource.pg_cluster +
local-exec kubectl apply, which only re-runs when its `triggers` change.
The YAML edit didn't bump a trigger, so the change was inert and never
applied (incl. via CI). Bump the pg_params trigger so the kubectl apply
re-runs and CNPG hot-reloads the new params (reloadable, no restart).

Landing it via a targeted apply (-target=null_resource.pg_cluster) to avoid
3 pre-existing unrelated drifts in this stack -- notably a mysql_standalone
volumeClaimTemplate annotation diff the apiserver rejects as immutable,
which is what fails broad dbaas applies (and silently blocked a2c8f906).

Refs: code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:25:37 +00:00
Viktor Barzin
a2c8f906ec dbaas: stretch CNPG checkpoint timer 5->15min + raise WAL size (cut sdc write IOPS)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to reduce CNPG checkpoint/WAL writes as part of the sdc
IOPS-isolation work (code-oflt). The IOPS deep-dive found CNPG checkpoints
fire 100% on the 5-min timer (checkpoints_timed >> checkpoints_req), each
triggering a full-page-write burst + flush onto the contended 7200rpm sdc
spindle -- a top write-IOPS source after etcd.

Set checkpoint_timeout=15min + max_wal_size=4GB + min_wal_size=1GB so
checkpoints fire ~1/3 as often (fewer FPW) and WAL segments are recycled
rather than churned. All three are sighup-reloadable -> CNPG applies them
without a restart or failover. checkpoint_completion_target stays 0.9 so
each checkpoint's IO is still smeared across the interval. Bounded
recovery-time tradeoff (more WAL to replay on crash), acceptable for the
write relief. wal_compression left at pglz ('on') pending image
zstd-support verification.

Also refreshes the stale CNPG tuning note in .claude/CLAUDE.md (it listed
shared_buffers=512MB / effective_cache_size=1536MB / 2Gi; live is 1024MB /
2560MB / 3Gi).

Refs: code-oflt (etcd/sdc IO isolation).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 11:41:09 +00:00
Viktor Barzin
3398873a16 k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held
gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt
buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security)
uptake now lags up to 7 days instead of <=1.

- var.schedule:        0 23 * * *  ->  0 23 * * 0   (detector: weekly Sunday 23:00 UTC)
- var.report_schedule: 7 6 * * *   ->  7 6 * * 1    (report: Monday 06:07 UTC, ~7h
  after the Sunday check, so nightly-report.py's ~25h staleness threshold stays
  valid AND still flags a missed weekly run; no STALE_SECONDS change needed)

The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename
= churn). Cadence wording updated across main.tf comments, nightly-report.py
docstring, and the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 06:22:20 +00:00
Viktor Barzin
e43e64c666 kyverno: disable reports-controller to stop etcd ephemeralreport load
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor flagged not wanting to wear the single non-RAID SSD with useless etcd
writes if etcd moves there. Investigation found the avoidable load is kyverno
reporting: the 2026-06-12 etcd-load-reduction disabled the report *features*
but left the reports-controller running (default --enableReporting +
--validatingAdmissionPolicyReports=true), so the 2026-06-21 kyverno upgrade
left a one-time pile of ~10.5k cluster/namespaced ephemeralreports (~114MB in
etcd) that nothing reaps (aggregation off). Listing that range starves etcd's
fdatasync enough to flap the apiserver (observed live 2026-06-28).

Disable the reports-controller outright (reportsController.enabled=false),
completing the 2026-06-12 intent. Reports are not consumed (violations surface
via Loki->Slack); admission enforcement (deny-* policies) and Keel mutation are
independent of it. The ~10.5k stale reports already in etcd are cleared
separately (throttled, out-of-band) since bulk-deleting them is itself
etcd-heavy.

Refs: code-oflt (etcd IO isolation), code-at4f (etcd starvation alerting).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 05:35:36 +00:00
Viktor Barzin
cf42042cba monitoring: re-trigger apply to persist state after CI cancel-race
All checks were successful
ci/woodpecker/push/default Pipeline was successful
No-op comment touch in loki.tf to force a clean `terragrunt apply monitoring`.
The pfSense egress-monitoring apply (commit 7fe2d978, CI pipeline #414) was
cancelled by a newer push and SIGKILLed mid-helm-upgrade: the live resources
applied (probes green, rules loaded) but the Terraform state write and the helm
release finalize were lost, leaving the prometheus release stuck in
pending-upgrade (manually unstuck). This commit re-applies the unchanged
monitoring stack so state matches live, with zero resource changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:58:49 +00:00
Viktor Barzin
f92075b7c5 fire-planner: solve FIRE targets to age 100 (horizon 60→72)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor plans to live to 100, so the portfolio must last that long. The
fire-targets CronJob was solving a 60-year horizon (≈ to age 88); set it to 72
(retire ~age 28 → age 100). Raises every case's FIRE number modestly (more years
to fund). A one-off in-cluster job re-solves the existing rows at the new horizon.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:49:20 +00:00
Viktor Barzin
7fe2d9780e monitoring: add pfSense WAN/egress alerting + probes
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for
~20 min while internal routing + Unbound stayed up; recovery needed a manual
reboot and NOTHING alerted — there was no egress probe and the cloudflared
replica metric stayed green. Add first-class egress monitoring so the next
occurrence pages in ~2 min instead of being noticed by a human.

- blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW
  so ICMP can use raw sockets).
- Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that
  failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 +
  1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers).
- Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable,
  InternetEgressDown (both providers dead), ExternalDNSResolutionDown,
  EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's
  exact "external down while internal up" signature), PfSenseVMDown.
- Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the
  cloudflared replica metric is blind to tunnel-connection loss. Threshold
  calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident).
- Alertmanager inhibit: WAN/egress-down suppresses the downstream egress
  symptom alerts so one root alert pages, not a storm.
- Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md.

All metric names + the cloudflared threshold verified against live
Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening
(dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred
and documented in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:46:30 +00:00
Viktor Barzin
6f042ee239 fix(fire-planner): grafana fire-planner-pg datasource survives pw rotation
Some checks failed
ci/woodpecker/push/default Pipeline failed
The fire-planner-pg Grafana datasource baked the rotating fire_planner DB
password into its provisioning ConfigMap at terraform plan-time, so on every
7-day static-role rotation the password went stale and ALL fire-planner-pg
dashboards (fire-planner, cost-of-living, and the new wealth FIRE Countdown)
silently failed with "password authentication failed for user fire_planner"
until the next stack apply.

Switch to the same live-env pattern wealth-pg / payslips-pg already use:
- new ExternalSecret grafana-fire-planner-pg-creds (monitoring ns, Reloader
  match) mirrors the rotating Vault static-creds/pg-fire-planner password
- datasource ConfigMap now references $__env{FIRE_PLANNER_PG_PASSWORD}
- Grafana mounts it via envFromSecrets; reloader (auto) restarts Grafana on
  rotation so the provisioned datasource never goes stale

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:14:42 +00:00
Viktor Barzin
35c0057d83 chrome-service: raise noVNC sidecar memory limit 96Mi->256Mi (fix OOMKill)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The noVNC sidecar (x11vnc + websockify) was OOMKilled (exit 137) repeatedly
whenever someone actively opened chrome.viktorbarzin.me — the view connected
then froze/hung. Idle usage is ~37Mi, but x11vnc + websockify
framebuffer/encode buffers spike past the 96Mi cap when streaming the
1280x720 screen to a client. Raised request 32Mi->64Mi, limit 96Mi->256Mi
(Burstable, aux tier). Already applied live via a transient kubectl patch
(Recreate rollout, verified 0 restarts since); this lands the durable state
so the next apply / daily drift-detection doesn't revert it to 96Mi.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:39:17 +00:00
Viktor Barzin
2e50c1235c chrome-service: grant emo shared browser access (noVNC + homelab browser CLI)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to give emo access to the cluster's headed Chrome so he can fill
in forms and get past anti-bot / captcha pages. emo was deliberately locked
out of chrome-service (noVNC Authentik allowlist was Viktor-only + his
power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE
his existing browser rather than stand up an isolated per-user instance,
accepting that emo can therefore reach Viktor's warmed logged-in sessions
(CDP has no per-context auth, so the single shared persistent profile is
reachable by anyone who can drive the browser). emo's CLI use is hands-off
(his agent can run it unattended).

- authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED
  so the admin-services-restriction policy admits him to chrome.viktorbarzin.me
  (noVNC). Reverses the prior Viktor-only lock; comment updated to record why.
- chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token
  (dashboard-sa.tf pattern), a chrome-service-portforward Role granting
  pods/portforward, and a cluster read-only binding (oidc-power-user-readonly)
  so the SA can resolve the Service and emo's normal read access doesn't regress.
- t3-provision-users.sh: install_browser_kubeconfig installs a dual-context
  kubeconfig for any user with a <user>-browser SA — SA token as the default
  context (non-interactive, works headless), personal OIDC retained as the
  oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the
  headless agent session that homelab browser needs.
- docs/architecture/chrome-service.md: document the shared-browser multi-user
  access model, the session-exposure trade-off, and how to grant/revoke a user.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:20:07 +00:00
Viktor Barzin
50077b43d4 paperless-ngx: drop TASK_WORKERS 6->4 (6 OOMKilled the pod mid-import)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
6 OCR workers crept past the 8Gi per-container memory cap over ~6h and
OOMKilled paperless at 15:00 during the Emo bulk import. The import
auto-recovered (the consume dir lives on the PVC, so a restart re-scans
and reprocesses — nothing lost), but it left the queue inflated with
re-queued duplicates and spiked etcd on each restart.

The 8Gi cap is the shared edge-tier `tier-defaults` LimitRange, not worth
raising for one namespace. 4 workers fit with headroom (4 measured
~1.3Gi). Matches the value applied live via `kubectl set env` during
incident response; this removes the drift so the next apply keeps it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:06:46 +00:00
Viktor Barzin
8236ae309d postiz: reconcile HCL to live (adopt unmerged stack config), keep parked
All checks were successful
ci/woodpecker/push/default Pipeline was successful
postiz's live deployment (Helm + Temporal + Elasticsearch + Authentik
OIDC + static-DB password) came from the never-merged branch
`wizard/postiz-cnpg-oidc`, so master's HCL was stale and a `terragrunt
apply` would have DESTROYED the stack. This lands that postiz config to
master so HCL == state == live (CI green; destroy-landmine gone).

Kept PARKED (postiz + temporal replicas = 0): IG-via-postiz is Meta-
blocked (it hardcodes retired Instagram scopes → OAuth "Invalid Scopes"),
which is why it was parked; IG runs via the instagram-poster service. To
revive later: flip postiz `replicaCount` + temporal `replicas` back to 1
and re-check image pins.

Notes captured in this reconcile:
- ES image pinned to 7.17.28 (the branch's 7.17.24 was a DOWNGRADE vs the
  live data → ES refused to start "cannot downgrade node 7.17.28→7.17.24";
  caught + rolled back during this work).
- The 4 Authentik resources (app/provider/group/binding) were re-imported
  into state (adopted, not recreated — no duplicate AK objects); the
  obsolete `external_secret_jwt` ExternalSecret was removed (Retain → its
  synced secret was kept).
- Vault-side cleanup (removing the unused pg-postiz rotated role) is
  deliberately NOT included here — deferred, postiz uses a static
  secret/postiz database_url.

State was already reconciled by a local `scripts/tg apply`; this commit is
the HCL catch-up (CI re-apply is a no-op).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:54:59 +00:00
Viktor Barzin
e518ada3d4 authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets
the SFE too, and the SFE login shows social-login buttons (emo is Google-only with
no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md +
authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:26 +00:00
Viktor Barzin
4fc09b7a61 Merge remote-tracking branch 'origin/master' into wizard/authentik-sfe-social
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Build Custom Authentik Image / build (push) Has been cancelled
2026-06-28 11:53:04 +00:00
Viktor Barzin
916516eeab authentik overlay patch3: SFE for ALL old iOS browsers + social-login links
Two follow-ups to patch2 (both in patch-compat-sfe.py, guarded):

1. compat_needs_sfe() now also serves the SFE to ANY iOS browser on iOS<=16.3,
   not just Safari. iOS Chrome/Firefox are WebKit skins (Apple mandate) reporting
   a non-Safari UA family, so the Safari-only check missed them and they still got
   the blank modern SPA. Added an os.family=="iOS" + version<=16.3 branch.

2. Inject static social-login <a> links (Continue with Google/GitHub/Facebook ->
   /source/oauth/login/<slug>/) into the SFE shell (flow-sfe.html). The SFE
   architecturally can't render Identification-stage sources (authentik docs), and
   emo's account (emil.barzin@gmail.com) is Google-only with NO password — so the
   SFE's username/password form was a dead end. The links are plain redirects that
   work on any browser. Slugs are static; re-verify on source changes.

Tag -> 2026.2.4-patch3; values repoint + docs land once GHA builds it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:03 +00:00
Viktor Barzin
08bdf32aa0 feat(fire-planner): FIRE Countdown dashboard section + monthly target solve
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Add a "FIRE Countdown" section to the wealth Grafana dashboard plus a monthly
CronJob that computes the targets it reads.

Viktor wanted a £ countdown to retirement in today's money, per life-case
(Solo / Household / Family) and per country, with progress, a projected date,
runway, and his safety guardrails — so he can see how close he is to FIRE
(ideally lean) without ever coming back to work.

- wealth.json: new country / with_home / savings_per_year template vars + a
  per-Case row (target NW at the 99% GK bar, progress gauge, still-needed,
  projected FIRE date, runway) and safety-valve panels (re-entry trigger vs
  £1.0M, 2.5yr cash buffer, pension tranche @57, Anca-bridge note). Reads
  fire_planner.fire_target via the fire-planner-pg datasource (Mixed).
- fire-planner stack: fire-planner-fire-targets CronJob (monthly, 2nd 10:00
  UTC) runs `recompute-fire-targets --countries all`.

Targets come from the solver shipped in fire-planner edb4d11.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:52:17 +00:00
Viktor Barzin
6ba60cbb2d authentik: repoint to overlay patch2 (SFE for old Safari) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the
SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE
login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth
downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:39:29 +00:00
Viktor Barzin
f10bb71562 authentik overlay: serve the no-JS SFE login to old Safari (patch #2)
Old Safari/WebKit (<=16.3, e.g. iPadOS<=16.3) can't parse authentik's modern
ES2022 flow SPA and gets a COMPLETELY BLANK login — exactly what emo's iPadOS-15.8
iPad hit. authentik already ships a no-JS Simplified Flow Executor (SFE, ES5) and
serves it via compat_needs_sfe(), but only for IE/old-Edge/PKeyAuth. Extend that
to old Safari so those clients get the REAL authentik login (password + MFA +
reputation, identity preserved — NO auth downgrade, no new credential store).

Chosen over a Traefik basic-auth fallback after an adversarial review: that route
would put a single, spoofable-UA password in front of vbarzin->wizard (passwordless
root on the cluster-controlling devvm) — an MFA->single-factor path to cluster root.
SFE keeps full authentik auth and is generic for any old browser.

Shipped as patch #2 in the existing overlay image (patch-compat-sfe.py — guarded:
asserts the upstream anchor + ast-parses; verified against the live interface.py).
Tag -> 2026.2.4-patch2; the values repoint lands once GHA builds the image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:38:05 +00:00
Viktor Barzin
87a450e9a3 vault: grant emo full read/write on his own secret/emo tree
Viktor asked that emo be able to edit his own secrets with full access.
emo's personal-emo policy was read-only (read on data, read/list on
metadata), so he could view but not change his personal secrets.

Widen it to the same self-service capability set every namespace-owner
already has over their own tree: create/read/update/delete/list on
secret/data/emo(+/*) and list/read/delete on secret/metadata/emo(+/*).
Scope is unchanged — still only emo's own secret/emo subtree, still a
named exception that does not widen the power-user tier in general.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:07:22 +00:00
Viktor Barzin
a1cf7ccaf6 authentik: repoint to the SLOW-1a overlay image + un-enroll Keel
All checks were successful
ci/woodpecker/push/default Pipeline was successful
GHA built ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch1 (public, verified
anonymously pullable). Point global.image at it (repository + tag pinned
explicitly so neither helm's appVersion default nor Keel can downgrade it — the
2026-06-10 boot-storm class) and remove keel.sh/enrolled from the namespace so
Keel won't auto-bump the custom tag. authentik is now manual-upgrade: bump the
Dockerfile FROM + this tag together on each authentik version bump.

Net effect once rolled: the identification-stage query drops ~1.4s -> ~14ms, so
the cold login-flow first-load stops being slow. (Does NOT affect old-browser
clients — iPadOS<=15/Safari<=15.6 still can't run the SPA; that's unfixable
server-side.) Docs: .claude/CLAUDE.md Authentik row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:46:21 +00:00
Viktor Barzin
7ec64ed5ff authentik: custom-image overlay to fix the 1.4s login-flow query (SLOW-1a)
Some checks are pending
Build Custom Authentik Image / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
The login flow's identification stage runs a bare select_subclasses() that
LEFT-JOINs every Source subtype table — ~1.4s server-side on every cold login
(verified live: 1527ms vs 14ms). Narrow it to only the subtypes that render a UI
login button (oauth/saml/plex/telegram/kerberos — not the sync-only ldap/scim),
via django-model-utils string accessors so no import is needed. Byte-identical
output, ~100x faster, robust to adding new login source types.

Shipped as a thin overlay over the official image (mirrors the diun/excalidraw
precedent): stacks/authentik/Dockerfile (FROM ghcr.io/goauthentik/server:2026.2.4
+ a guarded sed) built by .github/workflows/build-authentik.yml -> ghcr.io/
viktorbarzin/authentik-server:2026.2.4-patch1. The values repoint + Keel freeze
land in a follow-up commit once the image is built. Upstream bug still present in
main (no fix/PR) — drop this overlay once upstream narrows the query.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:42:58 +00:00
Viktor Barzin
eebb6c8594 k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case
The nightly upgrade chain detected 1.36, the preflight compat-gate refused it,
and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY
night — even though the block is unactionable (no kyverno/ESO release supports
1.36 yet, and gpu-operator is pinned to its current version because bumping it
needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor
asked to teach the checker to tell 'we can fix this' apart from 'nothing to do
but wait', and stop the nightly Failed-Job + alert noise for the latter.

compat-gate.py now classifies each blocker:
  - ACTIONABLE: a newer addon version in addon-compat.json supports the target
    -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the
    nightly report).
  - WAITING-on-upstream: no released version supports the target yet -> held.
  - PINNED: addon marked pinned in the matrix (gpu-operator) -> held.
Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert.

Tidy the block path (Viktor's scope choice): deliberate gate decisions now make
the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a
non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete
'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge
is pushed definitively once per run (no 1->0->1 flap that re-notifies). The
detector re-spawns a refused-but-Complete preflight nightly (silently) so a
standing hold still re-evaluates, and only announces genuine new/Failed spawns.

nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class.
gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason).

Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned;
Calico the lone actionable piece) — no nightly Failed Job, no alert, just the
morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:08:20 +00:00
Viktor Barzin
b3c419e108 Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 09:55:25 +00:00
Viktor Barzin
a3eb309e26 calico: fix empty Whisker UI — allow whisker egress to the kube-dns ClusterIP
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog
added in 8d1d2fb9 was treating a symptom). The tigera operator's own `whisker`
NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the
kube-dns *pods* (podSelector k8s-app=kube-dns). But whisker-backend resolves
goldmane.calico-system.svc via the kube-dns *ClusterIP* (10.96.0.10), and Calico
drops UDP DNS to a ClusterIP under a podSelector-only egress rule.

Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100%
timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy
resolves fine; a test pod with the operator's podSelector-only egress rule
reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to
100% ok. whisker-backend resolves goldmane once in the brief startup window
before the policy programs, holds its long-lived gRPC stream, and only
re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked
ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable
aggregator (separate pod, unrestricted namespace) was never affected.

Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip
(whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so
the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop
(repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns
now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace
list. Docs (runbook + CLAUDE.md) updated to the real root cause.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:32:28 +00:00
Viktor Barzin
385dfff0e7 authentik: fix episodic blank-screen + 30s-hang login (reliability R2)
The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)

Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
  reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
  and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
  workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
  from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
  blank screen.

Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:17:05 +00:00
Viktor Barzin
b84b0021c2 authentik: dedicated rate-limit carve-out + per-router 5xx observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).

- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
  mirroring the existing health/tripit carve-outs). The authentik / and /static
  ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
  `traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
  per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
  Narrowed it to keep the counter while still dropping the high-cardinality
  `*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
  for the episodic all-3-server-pods-NotReady 502/503/504 cascade.

Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:10:34 +00:00
Viktor Barzin
8d1d2fb999 calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials
goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a
transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its
pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver
got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ...
i/o timeout" forever, never reconnecting. The operator ships whisker-backend
with NO liveness probe, so nothing restarted it; the live UI stayed blank until
a manual `kubectl delete pod`. (The durable aggregator is a separate pod and
was unaffected — only Whisker's ~60-min live view went dark.)

Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe.
Instead add a watchdog so this never needs a manual restart again:
- whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding
  (calico-system only: pods get/list/delete, pods/log get).
- It restarts the whisker pod only when whisker-backend logs >=10 goldmane-
  connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard
  avoids restart-thrash during a real Goldmane outage).
- Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors"
  and does not restart.

Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal
note; the stale 2026-06-25 "digest never posted" known-state block is updated
to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md
flow-trail bullet gains the whisker-wedge gotcha.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 08:59:31 +00:00
Viktor Barzin
3cc8f9f661 paperless-ngx: keep mem limit at 8Gi (tier LimitRange caps containers)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The prior commit set the limit to 10Gi, but the shared tier-defaults
LimitRange caps per-container memory at 8Gi, so the rollout's new pod was
forbidden (FailedCreate) and paperless was briefly down. 8Gi is ample for
6 workers anyway (4 workers measured ~1.3Gi under full OCR load). Restored
service live via kubectl patch; this commit matches TF to the live 8Gi so
drift detection won't re-revert it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 19:37:59 +00:00
Viktor Barzin
21d20dccf8 paperless-ngx: bulk-import via PVC consume dir (restart-safe) + 6 workers
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Emo's ~13.7k-document import was going through the API upload path, which
stages each file on the pod's EPHEMERAL scratch before queuing it. Any
paperless pod or redis restart therefore destroyed all in-flight work
(the "File not found" failures we hit) and required manual re-uploads.

Move bulk ingest to paperless's consume directory placed on the encrypted
PVC, with PAPERLESS_CONSUMER_POLLING so the whole folder is re-scanned
periodically (and on startup) with a file-stability check. Files now live
on durable storage and survive any restart — the folder is the queue and
self-heals, so we can copy everything in fast and let it process over
time with zero retry/integrity risk. RECURSIVE preserves the source tree
(avoids basename collisions); owner+tag come from a consumption workflow.

Bump TASK_WORKERS 4->6 to speed the OCR/convert-bound processing (node6
has the core headroom for one pod) and mem limit 8->10Gi for the extra
workers. Revert workers/mem/consume envs to defaults once the import ends.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 19:35:10 +00:00
Viktor Barzin
2cb37d51d4 paperless-ngx: scale Gotenberg x3 + Tika x2, 4 workers, skip-archive — speed the Emo import
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Bottleneck found: single Gotenberg 503s under concurrent workers (office docs
failing + slow). Cluster is otherwise idle (sdc 0.5% util, etcd ~1/min), so:
- Gotenberg 1->3 + Tika 1->2 (Service load-balances; fixes the 503s, parallel
  office conversion).
- paperless TASK_WORKERS 2->4, THREADS_PER_WORKER 2->1, mem limit 4->8Gi (avoid
  OOM with 4 concurrent OCR). Requests kept low to stay within tier-quota
  (requests.memory 3840/4096Mi).
- PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text: skip redundant archive for born-
  digital/office docs (big IO saver for the work-doc set).
Guard + etcd watch stay in place; revert to defaults after the import.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 18:45:25 +00:00
Viktor Barzin
d6bd9486e3 Merge remote-tracking branch 'origin/master' into wizard/portal-onboarding-paths
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build k8s-portal / build (push) Has been cancelled
2026-06-27 16:34:44 +00:00
Viktor Barzin
fca948a23d k8s-portal: document all three cluster-access paths in onboarding
The Getting Started portal only walked through the heaviest path (local VPN + kubectl + Vault + sops install) and never mentioned the two zero-setup routes that users actually reach first. Restructure onboarding to lead with all three, recommendation first: (A) the t3 web terminal, which drops you into a ready shell with kubectl/Vault/repos preinstalled; (B) the k8s web dashboard, auto-authenticated per user; and (C) the existing own-machine setup. Flag the dashboard/terminal as the fallback when CLI OIDC login is unavailable, reframe the misleading home-page 'VPN required' banner (only path C needs it), add the access endpoints to the service catalog, and fix a stale Vaultwarden URL (was vault.viktorbarzin.me, which is actually HashiCorp Vault; Vaultwarden is vaultwarden.viktorbarzin.me).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 16:34:36 +00:00
Viktor Barzin
9599beadc9 paperless-ngx: 2 task workers + 2 threads/worker + 4Gi limit for the Emo bulk import
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Emo's ~13.7k-doc import is OCR-bound on a single celery worker (~10s/doc =
multi-day). Bump PAPERLESS_TASK_WORKERS=2 + THREADS_PER_WORKER=2 for ~2x
throughput, and the memory limit 2Gi->4Gi to fit two concurrent OCR jobs.
Kept deliberately modest: archive writes hit the shared sdc HDD that etcd
also lives on (IO-storm risk, code-oflt) — watch etcd apply latency and
revert workers to 1 if it degrades. Revert to defaults once the import done.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 16:33:43 +00:00
Viktor Barzin
c13a3f1694 plotting-book: pull image from private ghcr instead of public DockerHub
Anca's plotting-book app now builds its image in her own GitHub repo to
the private package ghcr.io/passionprojectsanca/book-plotter (off public
DockerHub viktorbarzin/book-plotter). Wire the cluster to pull it:

- stacks/plotting-book: point the deployment baseline image at the ghcr
  package and add imagePullSecrets {ghcr-credentials} so the pod can pull
  the private image (the live tag is still CI-owned via ignore_changes).
- stacks/kyverno: add the plotting-book namespace to the ghcr-credentials
  allowlist so the Kyverno generate policy clones the pull secret into it.
  Verified the shared ghcr_pull_token (Viktor, repo-admin on Anca's repo)
  can read the private package before wiring this.

Docs: correct ci-cd.md (it wrongly listed plotting-book as already on
ghcr — it was on DockerHub) and note the special arrangement; amend
ADR-0003 to record that this GitHub-first repo builds to its own org's
ghcr namespace.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:32:19 +00:00
Viktor Barzin
5b49634fe0 rybbit/crowdsec-cf-sync: stop Cloudflare Lists-API retry-storm (429 self-DoS)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The edge-ban sync was failing every 2 min on Cloudflare HTTP 429
(rate-limited) and never recovering, leaving the crowdsec_ban list empty.

Root cause: backoff_limit=2 made k8s re-run a failing pod up to 3x within
seconds, so each */2 cycle fired a burst of POSTs into Cloudflare's
per-60s Lists-API write limit. That kept the throttle perpetually tripped
(it stopped clearing even after minutes of quiet) — a self-inflicted DoS.

Two changes make the sync gentle and self-healing:
- backoff_limit 2 -> 0: one attempt per */2 cycle (the schedule IS the
  retry cadence), no rapid-fire burst.
- lapi_kv_sync.py: treat a CF 429 as a soft-skip (exit 0, retry next
  cycle) like the existing LAPI fail-safe, instead of fail-loud + k8s
  retry. Any other CF error still fails loud.

Found during a cluster health check (AIOStreams CSI + pfSense SSH issues
handled separately).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:23:42 +00:00
Viktor Barzin
f92ab04dae vault: grant emo read-only access to his own secret/emo
emo (power-user tier) had no Vault policy granting his personal secret
path, so `vault kv get secret/emo` failed. Viktor asked to give him that
access. Adds a read-only `personal-emo` policy (read on secret/data/emo +
metadata) and attaches it to emo's OIDC identity by adopting the
entity/alias Vault auto-created on his first login. Scoped explicitly to
emo; does not widen the power-user tier (which stays secret-less).

Verified live: a personal-emo token reads secret/emo, is denied writes,
and is denied other paths (secret/viktor -> 403).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 13:35:57 +00:00
a7117e0bfe immich(frame-emo): bump photo-frame Interval 30->45s
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Permissions-test change requested by Viktor: slow Emo's Sofia photo-frame
slideshow from 30s to 45s per image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 13:07:00 +00:00
Viktor Barzin
d50962b00e immich: add Immich photo-frame for Emo's Portal (highlights-immich-emo)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Second ImmichFrame instance cloned from the London frame (frame.tf), scoped to Emo's Immich account (emil.barzin) with Sofia weather coords and last-2-years photos. Drives Emo's Meta Portal Mini in Sofia via the portal-immich-frame app. Dedicated API key minted on Emo's account and stored in Vault (secret/immich -> frame_api_key_emo).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 12:40:29 +00:00
Viktor Barzin
e8b72019b5 paperless-ngx: deploy Tika + Gotenberg for Office ingest + raise PVC ceiling to 80Gi
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Emo's import scope now includes his work-PC document set (C/Documents,
Project Management, Service & MRO, etc. on the NAS), which is ~4.9k Office
files (.doc/.docx/.xls/.xlsx/.ppt/.pptx) on top of Emo shared. Paperless
can only archive/OCR/index those if it can convert them, so add the standard
Apache Tika (text+metadata) + Gotenberg (-> PDF) sidecar deployments + their
services in the paperless-ngx namespace and point PAPERLESS_TIKA_* at them.
Pinned images (gotenberg 8.25, tika 3.3.1.0), single replica, no PVC.

Total in-scope document set across all NAS locations is now ~13,700 PDF+Office
files / ~13.7GB source (~30GB once OCR'd + archived), so raise the data PVC
autoresize ceiling 30Gi -> 80Gi for comfortable headroom. The topolvm
autoresizer grows on demand up to the ceiling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 12:02:04 +00:00