Commit graph

228 commits

Author SHA1 Message Date
Viktor Barzin
316cdb7441 docs: valia-sites runbook + dns.md CM mechanism + service-catalog entries
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Runbook covers add/update/retire (one map entry; internal DNS now
cleans up after itself), content rules for Valia's folders, and the
failure modes incl. both token re-mint paths. dns.md superset-rule
paragraph now describes the declarative ConfigMap reconcile instead of
hand-added static CNAMEs. Catalog: new valia-sites row; stem95su row
notes its Pages cutover is parked on the 42.9MB stem_video.mp4
exceeding the 25MB Pages per-file cap.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:46:24 +00:00
Viktor Barzin
e1bd111562 rename CF Pages site most.viktorbarzin.me -> bridge.viktorbarzin.me
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to rename the 'мост' school static site to 'bridge'.
New Cloudflare Pages project 'bridge' (bridge-cv2.pages.dev) already
deployed and the custom domain attached; this renames the public CNAME
(TF resource most_pages -> bridge_pages, destroy+create swaps the
record) and the internal split-horizon static CNAME in the
ingress-dns-sync CronJob. The old 'most' Pages project and the stale
internal 'most' record are removed out-of-band after this applies.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:52:30 +00:00
Viktor Barzin
7dd80b6c7c technitium: mirror most.viktorbarzin.me into the internal zone (CF Pages site)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The internal split-horizon zone is authoritative for viktorbarzin.me,
so the new Cloudflare Pages site (most.viktorbarzin.me, added for
Viktor's 'мост' school static site) NXDOMAINed for every internal
client — LAN, VLANs and pods — while resolving fine externally.
Per the superset rule, add it as a static CNAME (-> most-6if.pages.dev)
in the ingress-dns-sync CronJob next to the mail-auth records, and
document the off-infra-site case in dns.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:10:46 +00:00
Viktor Barzin
217a54be9d cloudflared: add most.viktorbarzin.me CNAME for Cloudflare Pages site
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to host a static HTML site (the 'мост' school project,
ОбУ „Отец Паисий", pulled from his Google Drive) on Cloudflare Pages
with a custom domain, as a try-out of Pages hosting. The site content
is deployed off-infra via wrangler to the Pages project 'most'
(most-6if.pages.dev); this CNAME points most.viktorbarzin.me at it.
The custom domain is already attached to the Pages project and is
waiting on this DNS record to validate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:06:33 +00:00
Viktor Barzin
be80ef23bb ADR-0017 rev 3: single switch — PE replaces the SG105E, CCTV rides a VLAN-30 trunk on the LAN1 cable
Viktor prefers not running two switches, so the TL-SG105PE takes over
all rack duties (apartment uplink, 4G, UPS, camera PoE) and the CCTV
segment moves onto a managed tagged trunk over the existing LAN1 cable:
pfSense net3 re-pointed from vmbr2 to vmbr0 tag=30 (applied live; same
MAC so vtnet3/dCCTV survived untouched). This is safe where the original
802.1Q rejection was not, because the managed switch is the only device
on eno1 and polices VLAN-30 membership. eno2/vmbr2 kept dormant as the
documented fallback. Old SG105E retires to cold spare; PE inherits
192.168.1.6. Glossary Segment term updated (all three segments are now
bridge-tags feeding untagged pfSense vNICs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 09:15:52 +00:00
Viktor Barzin
e11bd6e893 ADR-0017 rev 2: two switches — the PE is a dedicated CCTV island, no VLAN table anywhere
Viktor asked to verify free ports on the garage switch (192.168.1.6)
before finalizing. Logging into it showed it is NOT the TL-SG105PE from
the plan but a pre-existing non-PoE TL-SG105E with 4 of 5 ports in use
(apartment uplink, R730 LAN1, 4G router, UPS) - the single-shared-switch
port-VLAN design written earlier today was based on conflating the two
devices. Corrected: the new TL-SG105PE carries ONLY camera + eno2
uplink (mgmt 10.0.30.6 inside the segment), the old switch is untouched,
and no VLAN config exists anywhere. ADR, topology SVG and networking.md
updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 08:37:15 +00:00
Viktor Barzin
248e186dce CCTV segment (dCCTV 10.0.30.0/24) on a dedicated pfSense leg for the garage camera
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor and emo are adding the first owned camera at the Sofia site (HiLook
IPC-T241H-C watching the garage / server rack). Viktor asked to finalize
emo's plan; the grilling session resolved emo's five open decisions and
replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated
physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24),
port-based VLAN split on the shared TL-SG105PE, camera default-deny with
NTP-only egress, Frigate + ha-sofia as the only consumers.

The PVE bridge, pfSense interface, Kea subnet and firewall rules were
applied live this session (hand-managed hosts, backed up). This commit
records the decision (ADR-0017), the glossary terms (Segment / CCTV
segment), the as-built architecture doc, and bumps Frigate's ADR-0016
VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 20:01:45 +00:00
Viktor Barzin
21afae85c9 dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor saw dawarich throwing 429s through Traefik and asked to loosen
the burst for it. The access log confirms the burst pattern: one page
load fires the whole fingerprinted-asset tail (SVG store badges,
favicons, webmanifest) from a single client IP and trips the default
10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429).
Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and
authentik: dedicated dawarich-rate-limit middleware (average 100 /
burst 1000) + skip_default_rate_limit on the dawarich ingress. Also
updates the networking.md middleware enumerations (adding the
previously undocumented tripit/health limiters alongside dawarich).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 15:03:08 +00:00
Viktor Barzin
8fc657f431 excalidraw: migrate image build to GHA -> private ghcr (ADR-0002)
The image was still built by hand and pushed to DockerHub (v1..v4),
predating the all-builds-off-infra doctrine; Viktor chose to move it
onto the standard pipeline while shipping the export/rename feature
rather than keep the manual flow.

Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml
(go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns
added to the Kyverno ghcr-credentials allowlist (package is PRIVATE),
deployment now pins ghcr :latest with pullPolicy Always + pull secret,
Keel force/match-tag/5m annotations seed the metadata (live values win
via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays
frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image
lists updated (also backfilled the missing k8s-portal rows in ci-cd.md).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:29:23 +00:00
Viktor Barzin
6f03ccd1aa excalidraw: grant emo-browser SA port-forward for drawing uploads
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to fix emo's permission so his Claude can upload to the
Excalidraw service. emo's recent sessions show the documented upload
recipe (kubectl port-forward svc/draw + X-Authentik-Username header,
from his ~/.claude/CLAUDE.md) failing with:

  pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser
  in namespace excalidraw

because his default kubeconfig is the read-only emo-browser SA (its
port-forward grant covers only chrome-service) and his old admin
kubeconfig at /home/emo/code/config expired and was removed.

Add a namespace-scoped Role (pods/portforward create) + RoleBinding for
that SA in the excalidraw namespace, mirroring the 2026-06-28
chrome-service grant. Trade-off (any-user drawings via the trusted
username header) documented in the file and accepted.

Also record the grant in docs/architecture/chrome-service.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 11:08:28 +00:00
Viktor Barzin
88c86e2109 ci: Slack-notify failed pipeline runs only
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor doesn't want a Slack message for every CI run — only failures.
The infra apply pipeline posted a status line to #general on every push,
and the renew-tls / postmortem-todos / registry-config-sync /
pve-nfs-exports-sync crons posted on every scheduled run (~30+ routine
messages a week). Now: the apply pipeline's success post is gone
(notify-failure already covers failures), all cron notifies are
status:[failure] with explicit FAILED texts, and drift-detection is
silent when all stacks are clean (still posts drift findings and errors,
and gains a hard-failure catch step it previously lacked). Kept:
notify-nonadmin-push (org audit feed) and the actionable provision-user
post. Per-app deploy template in ci-cd.md updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:27:43 +00:00
Viktor Barzin
a64d2ba2b9 upgrades: fix hourly gotenberg error + cap update notifications at weekly
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor was getting upgrade-error Slack messages every hour and wants
update notifications at most weekly. Root cause of the errors: Keel kept
trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's
require-trusted-registries denied it — gotenberg/* (and apache/*, which
tika will hit next) were never allowlisted, and Keel's Slack notifier at
info level re-posted the identical failure to #general on every hourly
poll since Jun 28.

Changes: allowlist gotenberg/* + apache/* so the patch applies cleanly;
disable Keel's direct Slack notifier and replace failure visibility with
a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification
plus the daily digest, never an hourly drip); remove diun's Slack
notifier whose default message @channel-pinged #image-updates for every
new upstream tag every 6h (the n8n upgrade-agent webhook feed is
untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC).
Paperless-ngx itself stays paused (keel policy=never, user-managed) while
the ingest runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:16:50 +00:00
Viktor Barzin
308a174ad6 docs(networking): record MetalLB .204 (frigate-rtsp go2rtc) allocation
All checks were successful
ci/woodpecker/push/default Pipeline was successful
PR #17 moved frigate-rtsp to a dedicated MetalLB LoadBalancer IP
(10.0.20.204) exposing RTSP 8554 + WebRTC 8555, but the networking doc
still listed only four IPs in use / three dedicated. Add the .204 row to
the allocation table, bump the counts (five in use, four dedicated, 5-IP
layout), and add a LB-IP renumber-checklist entry for the out-of-band
consumers (the go2rtc WebRTC candidate on the frigate config PVC and the
HA-sofia rtsp_url_template). Note go2rtc cannot use a DNS name in ICE
candidates, so the Service annotation is the single source of truth.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:42:27 +00:00
Viktor Barzin
2e50c1235c chrome-service: grant emo shared browser access (noVNC + homelab browser CLI)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to give emo access to the cluster's headed Chrome so he can fill
in forms and get past anti-bot / captcha pages. emo was deliberately locked
out of chrome-service (noVNC Authentik allowlist was Viktor-only + his
power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE
his existing browser rather than stand up an isolated per-user instance,
accepting that emo can therefore reach Viktor's warmed logged-in sessions
(CDP has no per-context auth, so the single shared persistent profile is
reachable by anyone who can drive the browser). emo's CLI use is hands-off
(his agent can run it unattended).

- authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED
  so the admin-services-restriction policy admits him to chrome.viktorbarzin.me
  (noVNC). Reverses the prior Viktor-only lock; comment updated to record why.
- chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token
  (dashboard-sa.tf pattern), a chrome-service-portforward Role granting
  pods/portforward, and a cluster read-only binding (oidc-power-user-readonly)
  so the SA can resolve the Service and emo's normal read access doesn't regress.
- t3-provision-users.sh: install_browser_kubeconfig installs a dual-context
  kubeconfig for any user with a <user>-browser SA — SA token as the default
  context (non-interactive, works headless), personal OIDC retained as the
  oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the
  headless agent session that homelab browser needs.
- docs/architecture/chrome-service.md: document the shared-browser multi-user
  access model, the session-exposure trade-off, and how to grant/revoke a user.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:20:07 +00:00
Viktor Barzin
250d0fc334 docs(authentik): document SFE forced-WebAuthn escape hatches (TOTP + social)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Old-browser users on the SFE who have a password but no MFA device hit the
default-authentication-flow's forced WebAuthn passkey enrolment, which the SFE
cannot render (the 'unsupported state: ak-stage-authenticator-webauthn' error).
emo (Google-only, iPadOS 15) hit this on the password path.

Document the two no-MFA-downgrade fixes: (1) social login, whose source flow
(default-source-authentication) has no MFA stage, so the SFE's social button
always completes; (2) enrolling TOTP, which the SFE can validate (unlike
WebAuthn) and which flips the MFA stage from force-enrol to validate. TOTP was
enrolled for emo and stored in his Vaultwarden authentik item; verified
end-to-end (a Bitwarden-generated code is accepted by authentik).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:24:40 +00:00
Viktor Barzin
e518ada3d4 authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets
the SFE too, and the SFE login shows social-login buttons (emo is Google-only with
no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md +
authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:26 +00:00
Viktor Barzin
6ba60cbb2d authentik: repoint to overlay patch2 (SFE for old Safari) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the
SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE
login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth
downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:39:29 +00:00
Viktor Barzin
ec681ba6e1 ci(infra): stop double-apply + stop counting PG lock-waits as failures
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):

1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
   AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
   push. The two applies race each other for the per-stack PG state lock →
   "Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
   ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
   lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
   whole pipeline with no retry.

Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
  the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
  (they live on repo 1), so we de-dup the apply without deactivating the
  registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
  timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.

Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).

Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:37:18 +00:00
Viktor Barzin
385dfff0e7 authentik: fix episodic blank-screen + 30s-hang login (reliability R2)
The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)

Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
  reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
  and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
  workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
  from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
  blank screen.

Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:17:05 +00:00
Viktor Barzin
b84b0021c2 authentik: dedicated rate-limit carve-out + per-router 5xx observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).

- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
  mirroring the existing health/tripit carve-outs). The authentik / and /static
  ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
  `traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
  per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
  Narrowed it to keep the counter while still dropping the high-cardinality
  `*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
  for the episodic all-3-server-pods-NotReady 502/503/504 cascade.

Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:10:34 +00:00
Viktor Barzin
0097bddf9f docs(ci-cd): add plotting-book build→ghcr→deploy flow diagram
ASCII flow of the migrated plotting-book pipeline (GHA build in Anca's
repo → private ghcr.io/passionprojectsanca/book-plotter → Woodpecker
redeploy hook → in-cluster pull via ghcr-credentials), plus the Kyverno
admission / Keel backstop / Vault pull-cred supporting cast and the
serving path. Appended to the existing plotting-book section.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:49:58 +00:00
Viktor Barzin
bbc797b30e ci(woodpecker): stop applying/planning the Tier-0 vault stack in CI
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The nightly drift-detection cron and every vault-touching push apply have
been failing because CI runs terragrunt plan/apply on the Tier-0 `vault`
stack, which manages Vault's own transit mount + ACL policies. The CI
`ci` Vault role intentionally lacks those admin perms (sys/mounts,
sys/policies/acl), so the run always errors:
  - apply: 403 on vault_mount.transit + vault_policy.personal_emo, plus an
    Invalid for_each (local.k8s_users from secret/platform is deferred)
  - drift: terragrunt plan exits 1 → fails the whole nightly run

vault is Tier-0 = human-applied via OIDC. Skip it in both pipelines:
- default.yml: skip `vault` in the platform-apply loop (kept in
  PLATFORM_STACKS so the app-stack detector still excludes it)
- drift-detection.yml: skip `vault` in the per-stack plan loop
- ci-cd.md: document the exclusion on both pipeline rows

Found during a CI-health sweep (user reported many failures): GitHub
Actions all green; all Woodpecker repos green except this recurring
infra-repo failure, doubled by the legacy repo-1 + repo-82 dual
registration.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:48:41 +00:00
Viktor Barzin
c13a3f1694 plotting-book: pull image from private ghcr instead of public DockerHub
Anca's plotting-book app now builds its image in her own GitHub repo to
the private package ghcr.io/passionprojectsanca/book-plotter (off public
DockerHub viktorbarzin/book-plotter). Wire the cluster to pull it:

- stacks/plotting-book: point the deployment baseline image at the ghcr
  package and add imagePullSecrets {ghcr-credentials} so the pod can pull
  the private image (the live tag is still CI-owned via ignore_changes).
- stacks/kyverno: add the plotting-book namespace to the ghcr-credentials
  allowlist so the Kyverno generate policy clones the pull secret into it.
  Verified the shared ghcr_pull_token (Viktor, repo-admin on Anca's repo)
  can read the private package before wiring this.

Docs: correct ci-cd.md (it wrongly listed plotting-book as already on
ghcr — it was on DockerHub) and note the special arrangement; amend
ADR-0003 to record that this GitHub-first repo builds to its own org's
ghcr namespace.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:32:19 +00:00
Viktor Barzin
bf40409141 docs(security): note crowdsec-cf-sync rate-limit resilience
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Document the backoff_limit=0 + CF-429 soft-skip hardening alongside the
cf-sync architecture description, with the why (the backoff_limit=2
retry-storm that escalated Cloudflare's Lists-API throttle into a stuck
state). Follow-up to 5b49634f.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 15:27:44 +00:00
Viktor Barzin
19d0f0933a chrome-service: supervise x11vnc in noVNC sidecar so the VNC view self-heals
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-novnc / build (push) Has been cancelled
The noVNC view at chrome.viktorbarzin.me went black: x11vnc (in the novnc
sidecar) attaches to the browser container's Xvfb over localhost:6099, and when
that container restarted (~8h ago, Chrome exited cleanly) x11vnc lost its X
connection and exited. Because the entrypoint ran x11vnc as an unsupervised
background child and then exec'd websockify as PID 1, the dead x11vnc was never
relaunched — :5900 stayed dead (a defunct zombie), websockify kept returning
'Connection refused', and the view was black until a manual pod restart.

Fix: the entrypoint now runs both x11vnc and websockify as supervised background
children and exits non-zero via 'wait -n' if either dies, so the kubelet restarts
the novnc container, which re-waits for Xvfb and relaunches x11vnc. The bridge
now self-heals across browser-container restarts. Mirrors the android-emulator
stack's supervision pattern. Architecture doc updated with the new failure mode,
diagnosis, immediate-recovery, and SHA-pin deploy note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:03:29 +00:00
Viktor Barzin
fd33d1a447 monitoring: consolidate all Slack alerting to #alerts, abandon #security
Some checks are pending
ci/woodpecker/push/default Pipeline is running
The dedicated #security Slack channel was unreachable: the shared incoming
webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a
Slack app that isn't a member of #security, so any channel override on it
returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently
failing for that reason.

Per request ("dump the security channel, post in an existing one"), route
everything to #alerts instead:
- alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>]
  title styling so security-lane alerts still stand out in the shared channel)
- goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value
  was already switched and applied last change)
- AggregatorDown / DigestFailing alert summaries reworded to say #alerts
- docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook,
  .claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the
  "invite the app / flip back to #security" caveats and state the
  #security abandonment + #alerts consolidation as the current routing.

Monitoring stack applied (alertmanager rolled, live config verified:
slack-security channel is now #alerts).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 13:29:44 +00:00
Viktor Barzin
6f1951af93 fix(workstation): carry OS/sudo authz policy into managed-settings source + multi-tenancy doc
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ADR-0015's policy change was applied live to /etc/claude-code/managed-settings.json, but that file self-deploys from the repo source scripts/workstation/managed-settings.json via the hourly reconcile (sync_managed_config). Without updating the source the next reconcile would REVERT /etc to the old 'never read other homes' rule. This updates the source-of-truth claudeMd (now byte-identical to /etc) so the change is durable + canonical, and refresh_codex_mirror propagates it to every user's ~/.codex/AGENTS.md. Also notes the access-model change in the multi-tenancy architecture doc (pointer to ADR-0015).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 08:25:33 +00:00
Viktor Barzin
6c5288998f goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):

- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
  auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
  (the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
  (prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
  TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
  update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
  security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
  #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
  edge table (feeds code-8ywc; enforce-flips out of scope).

Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:49:25 +00:00
Viktor Barzin
e88ea50304 docs(multi-tenancy): document install_skills (vendored per-user agent skills)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Record the new reconcile step alongside install_memory/install_playwright:
vendored own-copies of the 16-skill set for the SKILL_USERS allowlist (emo),
why it's vendored not npx (upstream drift), and that if-absent keys on the
user's own copy so it heals a stale/cross-user ~/.claude/skills symlink
(emo's grill-me pointed into the admin's home).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:30:27 +00:00
Viktor Barzin
59f2beda21 chrome-service: run real Google Chrome (H.264/AAC codecs) for the browser
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Point the chrome-service container at the new chrome-service-browser image and
launch /opt/google/chrome/chrome instead of the bundled Chromium. Fixes
MEDIA_ERR_SRC_NOT_SUPPORTED on H.264/AAC video (Instagram Reels etc.) in the
noVNC view — bundled Chromium has those codecs compiled out; only real Chrome
carries them. connect_over_cdp callers (tripit fare scrape, homelab browser,
snapshot-harvester) attach over raw CDP (version-tolerant) — validated after
rollout. Image is built off-infra on GHA (prior commit) → public ghcr.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 21:15:36 +00:00
Viktor Barzin
a3cdc0d6d0 chrome-service: size headed Chrome window to fill Xvfb (noVNC cut-off)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The noVNC view showed the browser in the top-left with the rest of the
framebuffer black. Cause: Chrome launched with no --window-size, and there's no
window manager, so it opened at its profile-persisted (smaller) size inside the
1280x720 Xvfb. Add --window-size=1280,720 --window-position=0,0 so the window
fills the screen on every launch (fresh pods/profiles too). Live windows were
already resized via CDP as a stopgap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 18:00:20 +00:00
Viktor Barzin
c7ead032ec chrome-service: fix noVNC stuck-"Connecting" (x11vnc fd-sweep under nofile=2^31)
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build chrome-service-novnc / build (push) Has been cancelled
The noVNC view hung on "Connecting" forever then timed out. Root cause: x11vnc
sweeps the entire fd table (fcntl per fd) on every client connection, and
containerd grants pods RLIMIT_NOFILE=2^31, so the RFB handshake never completes
(websockify accepts the WS and dials localhost:5900, but x11vnc never sends its
banner — verified: handshake timed out at 8s, x11vnc had burned 1h41m CPU
spinning). Same bug + fix the android-emulator stack already carries.

Cap nofile before x11vnc starts, in two places:
- files/novnc/entrypoint.sh: `ulimit -n 65536` (root fix, makes the image correct)
- main.tf novnc container: `command = ["bash","-c","ulimit -n 65536; exec /entrypoint.sh"]`
  so the cap applies deterministically on rollout even though the image is
  :latest/IfNotPresent (a rebuilt entrypoint isn't guaranteed to be re-pulled).

Also documents the gotcha + diagnosis in docs/architecture/chrome-service.md and
notes the black-when-idle behaviour + the autoconnect URL.

(A live x11vnc relaunch with the cap already unblocked the running pod; this
makes it survive restarts.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 17:34:03 +00:00
Viktor Barzin
7dbbb74163 homelab v0.8.1: frame browser as escalation (default headless), match CLAUDE.md
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build infra CLI / build (push) Has been cancelled
Make `homelab browser --help` and chrome-service.md state the same tiered rule
now in ~/code/CLAUDE.md: default to the Playwright MCP/headless browser for all
routine automation; reach for `homelab browser` ONLY when headless is blocked
(loads-but-submit-fails / one request errors while siblings 200 / explicit bot
wall). Removes the "co-equal choice" framing so agents have one non-conflicting
instruction. Adds a test asserting the tiered wording so it can't regress.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 15:44:43 +00:00
Viktor Barzin
a6b52a5839 homelab v0.8.0: browser verbs for headful anti-bot web automation
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Add `homelab browser run|open` so agents can drive the cluster's headful
Chrome (chrome-service) over CDP from the devvm. The headless playwright/mcp
browser can load anti-bot sites and fill their forms, but the gated submit
silently fails — e.g. the Stirling Ackroyd Fixflo tenant portal returned
net::ERR_FILE_NOT_FOUND on its pre-submit check and hung, creating nothing.
Driving the real headful Chrome submits first try. That capability already
existed but was undiscoverable, so it cost ~40 min + redundant form re-runs to
find; now it is one command, versioned, test-covered, and `browser --help`
carries the when-to-use signature + an error-code cheat-sheet so the right tool
is reached at the right moment (the failure was judgment, not setup).

- port-forward svc/chrome-service:9222 (tunnels API-server->pod, so it bypasses
  the :9222 NetworkPolicy), assert non-headless via /json/version,
  connect_over_cdp, inject the same vendored stealth.js the in-cluster callers
  use; the port-forward is always torn down, on success and on error.
- node CDP client pinned to playwright-core@1.48.2 to match the v1.48.0-noble
  image (Chromium 130); self-provisioned lazily into ~/.cache/homelab, no
  per-user setup.
- default is a fresh incognito context (safe for the shared browser + concurrent
  callers); --shared-context reuses the warmed persistent profile.
- TDD: cmd_browser_test.go covers arg parsing, headless detection, the version
  pin, the help cheat-sheet, and a stealth.js drift guard. Verified end-to-end
  against bot.sannysoft.com (real Chrome UA, webdriver hidden, plugins/WebGL
  spoofed) and `browser open`.
- docs: README v0.8 section, ADR-0013, and a chrome-service.md "driving from
  outside the cluster" section.

Closes: code-nepg

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 12:22:22 +00:00
Viktor Barzin
a0897de7c3 workstation: document homelab-memory hooks + provisioner self-deploy [ci skip]
multi-tenancy.md never mentioned the homelab-memory hooks rollout and still
listed claude_memory credential injection as purely "future". Document what is
actually true now: install_memory provisions the recall/auto-learn/compaction
hooks per user, the provisioner binary self-deploys from the repo (step 0), the
set -e abort fix, and that the hooks no-op without a MEMORY_API_KEY in env (CLI
defaults the URL) — emo has a key, ancamilea is keyless until one is minted.
Also clarify setup-devvm.sh's binary install is now bootstrap-only (ongoing
edits self-deploy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 08:04:38 +00:00
Viktor Barzin
ead876ec65 k8s-upgrade: nightly Slack report monitor + scope chain-failed alert to phases
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").

Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.

What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
  running version, detector freshness, detected target, outcome (no-op /
  blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
  Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
  for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
  Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
  v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
  `k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
  (or any future helper) can't false-trip the chain-wedged alarm.

Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:57:44 +00:00
Viktor Barzin
7270e2be3b monitoring: K8sUpgradeChainJobFailed must not double-fire on a compat-gate block
Some checks failed
ci/woodpecker/push/default Pipeline failed
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.

Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 16:35:35 +00:00
Viktor Barzin
ceae4d5f06 docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed)
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:39:26 +00:00
Viktor Barzin
68d9058f85 cleanup: fully remove orphaned council-complaints app
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.

This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
  allowlist comment claimed council-complaints as the last referencer;
  rewrite it (no live workload pulls from that registry now; only stale
  completed Job records still carry the ref). The allowlist line itself
  is kept (registry-scoped, not app-specific).

Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:32:10 +00:00
Viktor Barzin
46166c63b2 fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)

Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
  and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
  as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
  exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
  ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
  every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
  carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
  make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
  and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
  (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
  (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).

Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:40:22 +00:00
Viktor Barzin
bc2fbc712c Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew 2026-06-20 20:10:48 +00:00
Viktor Barzin
5549fc3672 Add per-user Claude auth renewal
Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.
2026-06-20 20:10:40 +00:00
Viktor Barzin
3278588325 chore(authentik): tear down obsolete tripit-enrollment (ADR-0020 superseded by ADR-0028)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 20:04:24 +00:00
Viktor Barzin
a5bb4db9c5 crowdsec: register the Traefik bouncer with LAPI (fix fail-open)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The Traefik bouncer plugin's API key was never registered with LAPI — the
crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and
the chart registers no bouncer. So LAPI returned 403 to the plugin, which with
updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist
bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was
empty; the registration was likely lost in the MySQL->PostgreSQL DB migration
with no IaC to recreate it.

Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same
Vault key the middleware presents — so they match by construction, and the
bouncer re-registers automatically on every LAPI start (survives DB wipes).

- stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module.
- module main.tf: new sensitive var + thread into the values templatefile.
- values.yaml: BOUNCER_KEY_traefik on lapi.env.
- docs/architecture/security.md: document registration + fail-open history and
  the proxied-app coverage caveat.

Activates enforcement (community blocklist bans + captcha) on non-proxied apps;
internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:08:28 +00:00
Viktor Barzin
fd0c7493c3 traefik/crowdsec: serve Cloudflare Turnstile for captcha remediation
CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse
(http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files),
but the Traefik bouncer plugin had no captcha provider configured — so those
decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go
@ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had
no way to self-unblock, contradicting the profile's stated intent.

Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha
decision now renders a solvable challenge instead of a hard block:

- New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to
  viktorbarzin.me so one widget covers every subdomain the bouncer fronts.
  Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are
  passed into the traefik module.
- middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s +
  captchaHTMLFilePath=/captcha/captcha.html.
- Vendor the plugin's captcha.html and mount it into the Traefik container at
  /captcha via the chart `volumes` value — the pulled Yaegi plugin does not
  expose its bundled template to Traefik.
- docs/architecture/security.md: document the ban-vs-captcha remediation split.
- Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with
  placeholder reCAPTCHA keys; referenced by zero .tf).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:38:38 +00:00
Viktor Barzin
fd77c0dc4f monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot
Some checks failed
ci/woodpecker/push/default Pipeline failed
The rpi-sofia under-voltage alert keyed off the sticky firmware bit
(rpi_under_voltage_occurred == 1), which latches on the first brown-out and
stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every
boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a
few of these lately" — and it disagreed with the HA-sofia dashboard, which shows
the live state and reads OK once voltage recovers.

Can't just switch to the live bit: rpi_under_voltage_now never registered once in
14d (brown-outs are sub-second and fall between the 1-min textfile-collector
samples), so the sticky bit is the only reliable detector.

Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0.
Fires once per brown-out and auto-resolves ~1h later (~2h active over the same
14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both
real brown-out events in the window are still caught. Docs updated in the same
commit (monitoring.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:45:39 +00:00
Viktor Barzin
c04efa3d3a k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Disruptive node drains should run when the cluster is idle. Move the
k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC
(00:00 London) — overnight, low usage, and clear of the kured OS-reboot window
(01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.)

  - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *.
  - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot
    (was next_daily_noon_utc).
  - docs (runbook, architecture) + upgrade-state SKILL: schedule references
    updated to 23:00 UTC nightly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:16:32 +00:00
Viktor Barzin
ed53b34bf4 k8s-version-upgrade: dynamic worker enumeration + IP-based SSH (auto-cover all/new nodes)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The chain hardcoded master→node4→node3→node2→node1→postflight and SSHed by
FQDN. It silently SKIPPED node5/node6 (added 2026-05-26) — postflight would
have failed even if reachable — and node5/node6 had no .viktorbarzin.lan DNS
records, so the chain couldn't SSH to them at all.

Refactor (upgrade-step.sh):
  - Worker set + order derived live from `kubectl get nodes` (worker_nodes /
    next_pending_worker), so EVERY worker still off-target is upgraded and a
    newly-joined node is covered with zero script change.
  - SSH targets are node InternalIPs (ssh_target), removing the dependency on
    node DNS records entirely — a new node is reachable the moment it joins.
  - The two remaining hardcoded loops (containerd skew, apt-repo rewrite) now
    enumerate workers/all-nodes dynamically too.
  - Topology preserved: master-drain Job runs on the first worker; every
    worker-drain Job runs on the already-upgraded k8s-master (self-preemption
    invariant intact).
  - next_pending_worker returns 0 explicitly on the no-match path — the
    `while read … done < <(…)` loop exits 1 at EOF, which under set -e would
    abort the LAST worker's Job before it spawns postflight (cluster upgraded
    but no cleanup / in_flight reset). Caught in review.

Docs (runbook + architecture + headers) updated to the dynamic topology.

NOTE: nodes still need the k8s-upgrade SSH public key in authorized_keys; it was
deployed to node4/5/6 by hand this session. Baking it into node provisioning
(so new nodes get it automatically) is the remaining follow-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 16:56:02 +00:00
Viktor Barzin
037a609f27 k8s-version-upgrade: unblock 1.34.9 — skip kubeadm CoreDNS addon + busybox-date fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The 1.34.9 master upgrade hard-failed `kubeadm upgrade apply` preflight: CoreDNS
is at v1.12.4 (Keel auto-bumped it 1.12.1 -> 1.12.4 on 2026-05-26 via a stale
kube-system out-of-band annotation), and 1.12.4 is ahead of kubeadm 1.34.9's
bundled corefile-migration table ("start version not supported").

- scripts/update_k8s.sh: master `kubeadm upgrade apply` now runs with
  `--ignore-preflight-errors=CoreDNSMigration,CoreDNSUnsupportedPlugins
  --skip-phases=addon/coredns`. A dry-run proved --ignore ALONE would overwrite
  our custom split-horizon Corefile with kubeadm's default AND downgrade the
  image; --skip-phases leaves CoreDNS 100% untouched while the control plane
  upgrades. CoreDNS is pinned off Keel (keel.sh/policy=never) to stop the drift.
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh: fix the preflight
  quiet-baseline (settle-window) check, which silently no-op'd on the ghcr
  claude-agent-service image's busybox `date` (can't parse ISO8601). Now tries
  GNU then busybox `-D`, and warns+skips on parse failure (no silent fail-open).
- docs: runbook + architecture document the CoreDNS handling.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:45:05 +00:00
Viktor Barzin
fb638cd8ec k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs
Some checks failed
ci/woodpecker/push/default Pipeline failed
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.

Docs updated to match the behaviour change (per the same-commit docs rule):
  - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
    "kill a stuck Job" recovery now leads with retry-on-failure self-heal.
  - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
    retry-on-failure note on the deterministic-naming paragraph.
  - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
    entry, and drill-down (also copied to the active ~/.claude copy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:10:18 +00:00