Commit graph

350 commits

Author SHA1 Message Date
Viktor Barzin
50778d47d3 drone-logbook: new stack — self-hosted Open DroneLog at dronelog.viktorbarzin.me
Viktor asked to self-host the DJI flight-log analyzer for his DJI Mini 4 Pro
(his fork ViktorBarzin/drone-logbook -> upstream arpanghosh8453/open-dronelog).
Upstream ghcr image with Keel auto-upgrade, DuckDB data on an encrypted
proxmox-lvm PVC (GPS traces = sensitive), NFS /sync-logs drop folder imported
every 8h, daily backup CronJob to /srv/nfs/drone-logbook-backup (vaultwarden
pattern), Authentik-gated ingress, PROFILE_CREATION_PASS from Vault via ESO.
Design + plan in docs/plans/; service-catalog updated.
2026-07-04 08:42:53 +00:00
Viktor Barzin
6698018ab6 service-catalog: add tasks row + tasks to the proxied-domains list
Some checks failed
ci/woodpecker/push/default Pipeline failed
Docs-with-change convention: the new tasks stack (Reminders-style PWA over
Nextcloud CalDAV) gets its catalog entry — what it is, its CNPG db + Vault
static role, the auth=required/X-authentik-username trust model with the
SEC-1 NetworkPolicy, and the ADR-0002 CI/CD path — and tasks joins the
Cloudflare proxied hostname list.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:53:42 +00:00
Viktor Barzin
21c6e7112e stem95su: retire the in-cluster serving stack — now a Valia site on Pages
Completes the ADR-0018 cutover. The stack is emptied to a tombstone so
CI destroys nginx, the NFS content volume, the ingress, the per-site
gdrive-sync CronJob and the namespace; serving + sync are owned by
stacks/valia-sites since the cutover commits. Catalog + runbook updated
to the migrated state (incl. the one-time 42.9→21.4MB video compression
Viktor approved).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:22:32 +00:00
Viktor Barzin
77fcb08e8e mailserver: add docs@ paperless ingest mailbox (sieve sender allowlist)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor asked to forward arbitrary emails with PDF attachments into
paperless-ngx, with the forwarding sender mapping 1:1 to the paperless
account that owns the document. paperless-ngx's built-in IMAP consumer
already does the sender->owner mapping, so the infra half is a dedicated
real mailbox docs@viktorbarzin.me: an explicit self-alias (the @domain
catch-all would otherwise divert it into the TripIt-swept spam@ mailbox,
whose sweeper LLM-parses and auto-replies to mail from linked senders)
plus a per-user Dovecot sieve that discards non-family senders at
delivery (chosen behaviour for unmatched senders: ignore and delete;
also keeps spam out of the guessable address). The mailbox credential
was added to Vault secret/platform.mailserver_accounts. Paperless-side
mail account + 5 per-sender rules are DB state, configured via the API
per the new runbook docs/runbooks/paperless-mail-ingest.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:06:19 +00:00
Viktor Barzin
316cdb7441 docs: valia-sites runbook + dns.md CM mechanism + service-catalog entries
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Runbook covers add/update/retire (one map entry; internal DNS now
cleans up after itself), content rules for Valia's folders, and the
failure modes incl. both token re-mint paths. dns.md superset-rule
paragraph now describes the declarative ConfigMap reconcile instead of
hand-added static CNAMEs. Catalog: new valia-sites row; stem95su row
notes its Pages cutover is parked on the 42.9MB stem_video.mp4
exceeding the 25MB Pages per-file cap.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:46:24 +00:00
Viktor Barzin
684ca4527c docs(CLAUDE.md): T4 now has a VRAM budget + watchdog (ADR-0016, dry-run); note llama-swap budget miscalibration
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Session wrap-up doc sync: the Immich note still claimed the shared T4 had no
VRAM isolation. Record the gpumem budget/watchdog shipped earlier today, that
the watchdog is observe-only, and that budgets need a retune (llama-swap's
real 16k-ctx resident is ~7GB, not 4.35) before arming.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 15:20:06 +00:00
Viktor Barzin
8fc657f431 excalidraw: migrate image build to GHA -> private ghcr (ADR-0002)
The image was still built by hand and pushed to DockerHub (v1..v4),
predating the all-builds-off-infra doctrine; Viktor chose to move it
onto the standard pipeline while shipping the export/rename feature
rather than keep the manual flow.

Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml
(go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns
added to the Kyverno ghcr-credentials allowlist (package is PRIVATE),
deployment now pins ghcr :latest with pullPolicy Always + pull secret,
Keel force/match-tag/5m annotations seed the metadata (live values win
via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays
frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image
lists updated (also backfilled the missing k8s-portal rows in ci-cd.md).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:29:23 +00:00
Viktor Barzin
339f5d89b9 onlyoffice: decommission (stack destroyed, dir removed)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The document server had been deliberately scaled to 0/0 for 184 days, but
its ingress kept the uptime-kuma monitors alive, so 'onlyoffice down'
showed up in every daily alert digest. Viktor approved tearing it down.
terragrunt destroy ran clean (11 resources) before this commit; the kuma
monitors auto-prune with the ingress. Also drops the onlyoffice/* image
prefix from the kyverno trusted-registries allowlist, the service-catalog
rows, and updates the nextcloud collabora comment. Document data (if any)
remains on the PVE NFS share.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-01 22:35:22 +00:00
Viktor Barzin
82371d1ef8 dbaas/mysql: innodb_doublewrite=DETECT_ONLY to halve page-flush writes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
MySQL device-write investigation (code-oflt): after the nextcloud webcal
throttle settled (the earlier 3.4-8.8 MB/s were post-restart transients),
MySQL is ~1.74 MB/s at the InnoDB level — and HALF of that (~0.86 MB/s,
~55 pages/s) is the doublewrite buffer writing every flushed page twice.
Redo is negligible (0.01 MB/s), no temp-table spilling.

Set innodb_doublewrite=DETECT_ONLY (dynamic, no restart; persisted in the
cnf): InnoDB stops writing full page CONTENT to the doublewrite buffer
(~halves MySQL's page-flush writes on the IOPS-bound sdc) but keeps
torn-page DETECTION metadata — a crash-torn page is flagged on recovery
(restore from the daily mysqldump) rather than silently corrupt. Chosen
over full OFF: same write saving, keeps detection, and OFF requires a
shutdown ("cannot change to OFF if doublewrite is enabled"). Acceptable
risk given the PERC BBU cache + UPS (in-flight writes complete on power
loss) + daily per-db backups.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:47:09 +00:00
Viktor Barzin
71501be408 nodes: journald -> volatile (RAM) to cut sdc write-IOPS
Some checks failed
ci/woodpecker/push/default Pipeline failed
Node "container churn" investigation (code-oflt): container logs (~30 KB/s)
and overlayfs (~17 KB/s) are negligible; the node OS-disk churn is ext4
journal (jbd2) metadata writes driven mostly by journald's continuous
appends. node4 + node5 had drifted to uncapped persistent journald (4 GB
each, ~100 KB/s); master/node1-3 were correctly capped at 500M.

Node + pod journals already ship to Loki (alloy loki.source.journal), so
on-disk journald is pure write-IOPS overhead on the IOPS-bound sdc. Switch
journald to Storage=volatile (RAM, RuntimeMaxUse=200M) fleet-wide:
- cloud_init.yaml: drop-in 90-oflt-volatile.conf for new nodes (replaces
  the old persistent seds).
- running nodes (master + node1-5): pushed the same drop-in via qm guest
  exec + journald restart + cleared /var/log/journal.

Verified node5: OS-disk writers jbd2/sda1-8 931->46 KB/s, systemd-journal
gone (~94% drop); ~4 GB freed each on node4/node5. Logs stay queryable in
Loki. Trade-off: a hard crash loses the last unshipped journal.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:15:38 +00:00
Viktor Barzin
1afe41880e docs: MySQL buffer-pool/limit + nextcloud webcal throttle; VCT drift fixed
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reflect the code-oflt MySQL write-reduction work (commit 82c9e69b + the
nextcloud webcal app-data throttle):
- MySQL row: buffer pool 1->2Gi, mem limit 4->6Gi, and the nextcloud
  webcal calendar churn that was ~60% of MySQL's writes (now throttled
  in oc_calendarsubscriptions.refreshrate — app-data, can regress).
- CNPG apply-gotcha note: the mysql_standalone VCT-annotation drift no
  longer needs -target dodging (now ignore_changes'd on the STS VCT).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:56:04 +00:00
Viktor Barzin
a3f2c2947a docs: refresh CNPG tuning note (archive_timeout=0, commit_delay, zstd) + apply gotcha
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reflects the write-reduction params applied in c3553731, and documents the
null_resource trigger-bump + targeted-apply gotcha so the next agent doesn't
hit the inert-change / mysql-VCT-drift traps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:17:38 +00:00
Viktor Barzin
4473b469e3 lvm-pvc-snapshot: cut retention 7->3 days (reduce sdc thin-pool CoW IOPS + free ~1TB)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Part of the sdc IOPS-reduction work (code-oflt). 462 daily thin snapshots
(66 PVCs x 7d) drive ~10-34 w/s of thin-pool metadata (tmeta) CoW writes on
the contended sdc spindle and pin ~2TB in the 70%-full pool. Halving to 3
days roughly halves both. Instant-restore window shrinks 7->3d; daily-backup
still keeps 4 weeks of file-level PVC history, so DR coverage is unchanged.

Deployed to the PVE host via scp (these host scripts are scp-deployed, not
TF-managed). Doc updated in .claude/CLAUDE.md.

Refs: code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:59:16 +00:00
Viktor Barzin
bebe8fbd74 workflows: add read-only memory-overcommit + node-removal capacity analysis
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reusable Workflow script that audits whether the cluster is memory-overcommitted and whether a single k8s worker can be removed to return RAM to the PVE host without sacrificing N-1 failover. Read-only throughout: gathers PVE host memory (qm config / free / KSM via SSH), k8s per-node capacity + cluster 30d peak working set, and per-workload right-sizing, then models N-1 two ways (physical actual-usage and scheduling-by-request) and adversarially verifies the conclusion with 3 skeptics.

Sizes requests (scheduling reservation) and limits (OOM ceiling) as SEPARATE knobs — an earlier ad-hoc pass conflated them by sizing requests to 30d peak, which manufactured a false N-1 shortfall. Invoke via Workflow {scriptPath}, or by name when cwd is the infra repo.

Requested by Viktor: identify memory overcommit and whether deployment requests can be trimmed to free PVE host RAM by removing a node, without sacrificing service reliability.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:06:17 +00:00
Viktor Barzin
a2c8f906ec dbaas: stretch CNPG checkpoint timer 5->15min + raise WAL size (cut sdc write IOPS)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to reduce CNPG checkpoint/WAL writes as part of the sdc
IOPS-isolation work (code-oflt). The IOPS deep-dive found CNPG checkpoints
fire 100% on the 5-min timer (checkpoints_timed >> checkpoints_req), each
triggering a full-page-write burst + flush onto the contended 7200rpm sdc
spindle -- a top write-IOPS source after etcd.

Set checkpoint_timeout=15min + max_wal_size=4GB + min_wal_size=1GB so
checkpoints fire ~1/3 as often (fewer FPW) and WAL segments are recycled
rather than churned. All three are sighup-reloadable -> CNPG applies them
without a restart or failover. checkpoint_completion_target stays 0.9 so
each checkpoint's IO is still smeared across the interval. Bounded
recovery-time tradeoff (more WAL to replay on crash), acceptable for the
write relief. wal_compression left at pglz ('on') pending image
zstd-support verification.

Also refreshes the stale CNPG tuning note in .claude/CLAUDE.md (it listed
shared_buffers=512MB / effective_cache_size=1536MB / 2Gi; live is 1024MB /
2560MB / 3Gi).

Refs: code-oflt (etcd/sdc IO isolation).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 11:41:09 +00:00
Viktor Barzin
7fe2d9780e monitoring: add pfSense WAN/egress alerting + probes
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for
~20 min while internal routing + Unbound stayed up; recovery needed a manual
reboot and NOTHING alerted — there was no egress probe and the cloudflared
replica metric stayed green. Add first-class egress monitoring so the next
occurrence pages in ~2 min instead of being noticed by a human.

- blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW
  so ICMP can use raw sockets).
- Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that
  failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 +
  1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers).
- Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable,
  InternetEgressDown (both providers dead), ExternalDNSResolutionDown,
  EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's
  exact "external down while internal up" signature), PfSenseVMDown.
- Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the
  cloudflared replica metric is blind to tunnel-connection loss. Threshold
  calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident).
- Alertmanager inhibit: WAN/egress-down suppresses the downstream egress
  symptom alerts so one root alert pages, not a storm.
- Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md.

All metric names + the cloudflared threshold verified against live
Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening
(dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred
and documented in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:46:30 +00:00
Viktor Barzin
e518ada3d4 authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets
the SFE too, and the SFE login shows social-login buttons (emo is Google-only with
no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md +
authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:26 +00:00
Viktor Barzin
6ba60cbb2d authentik: repoint to overlay patch2 (SFE for old Safari) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the
SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE
login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth
downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:39:29 +00:00
Viktor Barzin
a1cf7ccaf6 authentik: repoint to the SLOW-1a overlay image + un-enroll Keel
All checks were successful
ci/woodpecker/push/default Pipeline was successful
GHA built ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch1 (public, verified
anonymously pullable). Point global.image at it (repository + tag pinned
explicitly so neither helm's appVersion default nor Keel can downgrade it — the
2026-06-10 boot-storm class) and remove keel.sh/enrolled from the namespace so
Keel won't auto-bump the custom tag. authentik is now manual-upgrade: bump the
Dockerfile FROM + this tag together on each authentik version bump.

Net effect once rolled: the identification-stage query drops ~1.4s -> ~14ms, so
the cold login-flow first-load stops being slow. (Does NOT affect old-browser
clients — iPadOS<=15/Safari<=15.6 still can't run the SPA; that's unfixable
server-side.) Docs: .claude/CLAUDE.md Authentik row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 10:46:21 +00:00
Viktor Barzin
b3c419e108 Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 09:55:25 +00:00
Viktor Barzin
a3eb309e26 calico: fix empty Whisker UI — allow whisker egress to the kube-dns ClusterIP
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog
added in 8d1d2fb9 was treating a symptom). The tigera operator's own `whisker`
NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the
kube-dns *pods* (podSelector k8s-app=kube-dns). But whisker-backend resolves
goldmane.calico-system.svc via the kube-dns *ClusterIP* (10.96.0.10), and Calico
drops UDP DNS to a ClusterIP under a podSelector-only egress rule.

Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100%
timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy
resolves fine; a test pod with the operator's podSelector-only egress rule
reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to
100% ok. whisker-backend resolves goldmane once in the brief startup window
before the policy programs, holds its long-lived gRPC stream, and only
re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked
ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable
aggregator (separate pod, unrestricted namespace) was never affected.

Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip
(whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so
the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop
(repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns
now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace
list. Docs (runbook + CLAUDE.md) updated to the real root cause.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:32:28 +00:00
Viktor Barzin
385dfff0e7 authentik: fix episodic blank-screen + 30s-hang login (reliability R2)
The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)

Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
  reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
  and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
  workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
  from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
  blank screen.

Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:17:05 +00:00
Viktor Barzin
b84b0021c2 authentik: dedicated rate-limit carve-out + per-router 5xx observability
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).

- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
  mirroring the existing health/tripit carve-outs). The authentik / and /static
  ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
  `traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
  per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
  Narrowed it to keep the counter while still dropping the high-cardinality
  `*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
  for the episodic all-3-server-pods-NotReady 502/503/504 cascade.

Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 09:10:34 +00:00
Viktor Barzin
8d1d2fb999 calico: add whisker-watchdog CronJob to self-heal a wedged whisker-backend
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials
goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a
transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its
pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver
got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ...
i/o timeout" forever, never reconnecting. The operator ships whisker-backend
with NO liveness probe, so nothing restarted it; the live UI stayed blank until
a manual `kubectl delete pod`. (The durable aggregator is a separate pod and
was unaffected — only Whisker's ~60-min live view went dark.)

Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe.
Instead add a watchdog so this never needs a manual restart again:
- whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding
  (calico-system only: pods get/list/delete, pods/log get).
- It restarts the whisker pod only when whisker-backend logs >=10 goldmane-
  connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard
  avoids restart-thrash during a real Goldmane outage).
- Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors"
  and does not restart.

Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal
note; the stale 2026-06-25 "digest never posted" known-state block is updated
to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md
flow-trail bullet gains the whisker-wedge gotcha.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 08:59:31 +00:00
Viktor Barzin
006f97ef58 docs: bless local terragrunt apply, but require committing every applied change
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to change the infra apply guidance: instead of 'never apply
locally, always rely on CI', the policy is now 'you MAY apply locally, but
always commit the change to the infra repo'.

- .claude/CLAUDE.md (Critical Rule: Terraform Only): new bullet making local
  apply explicit (scripts/tg apply / homelab tf apply) from the MAIN checkout
  (not a worktree — git-crypt'd tfvars read as ciphertext there), with a hard
  requirement that every applied change is committed + pushed to master the same
  session so the repo stays the source of truth and CI drift-detection doesn't
  revert it. Spells out the apply<->commit ordering both ways.
- AGENTS.md (non-admin workstation land steps): step 5 now notes local apply as
  an option alongside CI auto-apply, with the same 'always committed, never
  applied uncommitted' rule.

Note: the org-managed settings block also frames CI auto-apply but is not
editable from a workstation clone.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 08:10:20 +00:00
Viktor Barzin
fd33d1a447 monitoring: consolidate all Slack alerting to #alerts, abandon #security
Some checks are pending
ci/woodpecker/push/default Pipeline is running
The dedicated #security Slack channel was unreachable: the shared incoming
webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a
Slack app that isn't a member of #security, so any channel override on it
returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently
failing for that reason.

Per request ("dump the security channel, post in an existing one"), route
everything to #alerts instead:
- alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>]
  title styling so security-lane alerts still stand out in the shared channel)
- goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value
  was already switched and applied last change)
- AggregatorDown / DigestFailing alert summaries reworded to say #alerts
- docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook,
  .claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the
  "invite the app / flip back to #security" caveats and state the
  #security abandonment + #alerts consolidation as the current routing.

Monitoring stack applied (alertmanager rolled, live config verified:
slack-security channel is now #alerts).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 13:29:44 +00:00
Viktor Barzin
6c5288998f goldmane-trail: polish follow-ups #57/#59/#61/#62/#63 + digest→#alerts
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):

- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
  auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
  (the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
  (prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
  TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
  update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
  security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
  #62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
  edge table (feeds code-8ywc; enforce-flips out of scope).

Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:49:25 +00:00
Viktor Barzin
c6bba1da6e home-assistant skill: refresh ha-london map (HAOS 2026.5.2, Cowboy revived, Overview redesign)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:03:15 +00:00
Viktor Barzin
7d297dc6b1 eso: complete migration — chart 2.6.0, all CRs on v1, 1.35 gate cleared
Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker).
Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time,
each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true
rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only
after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to
upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s
1.34 -> 1.35 on its next nightly run.

Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the
terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox,
but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent
dependency lock file: no version selected"). Reconciled via `tg init -upgrade`
and committed so `terragrunt apply`/CI work cleanly again.

Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc
marked COMPLETE.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-23 09:55:51 +00:00
Viktor Barzin
e2018f9b6c docs: memory via homelab CLI, not the retired memory-tool/MCP
The claude-memory MCP/plugin was uninstalled 2026-06-21 (recall now via the
homelab-memory-recall.py UserPromptSubmit hook; store/recall/update via the
`homelab memory` CLI, which hits the same remote HTTP API). Updates the
.claude/CLAUDE.md 'remember X' instruction off the obsolete local memory-tool
CLI + memory_search/memory_get onto the homelab CLI. Matches the root monorepo
CLAUDE.md + ~/.claude/rules/execution.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 17:24:00 +00:00
Viktor Barzin
ceae4d5f06 docs: rewrite CrowdSec enforcement architecture (firewall-bouncer + CF WAF; Yaegi plugin removed)
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:39:26 +00:00
Viktor Barzin
68d9058f85 cleanup: fully remove orphaned council-complaints app
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.

This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
  allowlist comment claimed council-complaints as the last referencer;
  rewrite it (no live workload pulls from that registry now; only stale
  completed Job records still carry the ref). The allowlist line itself
  is kept (registry-scoped, not app-specific).

Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 13:32:10 +00:00
Viktor Barzin
5a136c7d53 docs: t3-migrate-idle runbook section + service-catalog + design status
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 12:40:46 +00:00
Viktor Barzin
7bd4612edf ci: scripts/tg waits out a contended state lock (-lock-timeout)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra CI pipeline was failing often — ~38% of the last 50 runs didn't
succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack
applies dying instantly with "Error acquiring the state lock".

Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline
skips a locked stack). Tier-1 stacks have no such fallback: they rely on
terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with
no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed
run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same
second), a human/agent applying locally, or the daily drift `plan`.

Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT)
on every state-locking verb (plan/apply/destroy/refresh), so a contended lock
WAITS for the holder to finish instead of failing. -auto-approve behaviour for
non-interactive applies is unchanged. Central wrapper change → covers CI, plus
local human/agent applies; no CI image rebuild (tg is read from the repo).

Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the
arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-21 00:15:39 +00:00
Viktor Barzin
6d5d3726d6 Merge remote-tracking branch 'origin/master' into wizard/ha-cli-verbs
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-20 23:46:29 +00:00
Viktor Barzin
48225f2dea homelab CLI v0.7: add ha token + ha ssh for Home Assistant
Mined another devvm user's Claude sessions for repeated, hand-rolled command
patterns worth absorbing into the shared CLI. The dominant signal was Home
Assistant "Sofia" work: a `kubectl | base64 | jq` token-extraction pipeline
re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented
~30x — every session. The existing `home-assistant-sofia.py` already covers the
API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative
path), so agents bypassed it and hand-rolled everything.

Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control
stays with the MCP):
- `ha token [--instance sofia|london]` (read): resolves the long-lived API token
  live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no
  pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`.
- `ha ssh [--instance sofia|london] -- <cmd>` (write): deterministic
  non-interactive ssh to the HA host using the invoking user's key.

Also fix the root cause: `home-assistant-sofia.py` now falls back to
`homelab ha token` when its env var is unset (works from any directory), and the
home-assistant skill points agents at these verbs + `homelab metrics query`
instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the
per-verb-group convention.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:46:09 +00:00
Viktor Barzin
46166c63b2 fix(authentik): long-lived social-login sessions + shield auth from CrowdSec lockout
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)

Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
  and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
  as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
  exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
  ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
  every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
  carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
  make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
  and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
  (previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
  (CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).

Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 23:40:22 +00:00
Viktor Barzin
4a66377425 forgejo: add "Sign in with GitHub" (OAuth2 source + auto-registration)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted people to be able to sign up with GitHub, not just the
native form or Authentik SSO.

- Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth
  --provider github` (name "github", matching the callback registered on
  the GitHub OAuth App). Like the existing Authentik source, it lives in
  Forgejo's DB rather than Terraform — there's no clean TF resource for
  login sources. Client id/secret mirrored to Vault secret/viktor
  (forgejo_github_oauth_client_id / _secret) for recovery.
- This commit's TF change: ENABLE_AUTO_REGISTRATION=true in
  [oauth2_client], so a first GitHub sign-in creates the account directly
  ("sign up with GitHub") instead of a link-to-existing detour. The
  GitHub identity is the trust gate for this path; Turnstile + email
  confirmation still gate the native form.

Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github
redirects to GitHub's authorize URL with the correct client id +
callback, and the login page renders the button. Final browser
click-through is the user's to do.

Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section +
secret-rotation + DB-loss recreate steps).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:41:49 +00:00
Viktor Barzin
963e4fcdde forgejo: open native self-signups, gated by Turnstile + email confirmation
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants Forgejo open for anyone to sign up, but without bot/spam
account floods. Flip the deployment from OAuth-only registration
(ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local
sign-up, and add two bot gates on the registration form:

  - Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget
    is managed in Terraform (turnstile.tf) via the CF Global API key, so
    the sitekey/secret are IaC, not a dashboard artifact.
  - Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the
    Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me
    (mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced
    credential Authentik uses (email-secret.tf ESO -> secret/authentik
    smtp_password).

Existing Authentik OAuth2 login is unchanged (additive). Deployment env
appended (not inserted) so the diff stays purely additive; a reloader
annotation rolls the pod on secret rotation.

Verified live: signup page renders the Turnstile widget, mailer delivers
a test message end-to-end, Forgejo healthy, plan-to-zero after apply.

Runbook: docs/runbooks/forgejo-open-signups.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:05:07 +00:00
Viktor Barzin
d709d338c6 service-catalog: add paperless-ai (RAG semantic search + auto-tagging)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Document the new paperless-ai service and the two non-obvious operational
facts: runtime config lives in the PVC .env (not TF env, which would shadow
it), and Qwen3 needs /no_think for parseable tagging output.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 06:44:00 +00:00
Viktor Barzin
c04efa3d3a k8s-version-upgrade: move detection to nightly 23:00 UTC (overnight upgrades)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Disruptive node drains should run when the cluster is idle. Move the
k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC
(00:00 London) — overnight, low usage, and clear of the kured OS-reboot window
(01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.)

  - stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *.
  - scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot
    (was next_daily_noon_utc).
  - docs (runbook, architecture) + upgrade-state SKILL: schedule references
    updated to 23:00 UTC nightly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 18:16:32 +00:00
Viktor Barzin
fb638cd8ec k8s-version-upgrade: scope chain-fail alert to terminal reasons + sync docs
Some checks failed
ci/woodpecker/push/default Pipeline failed
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.

Docs updated to match the behaviour change (per the same-commit docs rule):
  - docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
    "kill a stuck Job" recovery now leads with retry-on-failure self-heal.
  - docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
    retry-on-failure note on the deterministic-naming paragraph.
  - .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
    entry, and drill-down (also copied to the active ~/.claude copy).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 13:10:18 +00:00
Viktor Barzin
cdd9ecd199 t3: docs for the gated nightly tracker (runbook, post-mortem, service-catalog)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Phase 4 docs for the enforcer -> gated-tracker change:
- runbook t3-version-bump.md: rewritten around the tracker — how each bump is
  gated, plus freeze/revert/pin/dry-run/manual-rollback ops.
- post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the
  gates close each named root-cause/lesson (historical sections left intact).
- service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker;
  replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy
  2026-06-16, cookieless -> 302 + t3_session).
- t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:33:49 +00:00
Viktor Barzin
92c5b24975 docs: ghcr_pull_token is now a scoped read:packages PAT, not the admin alias
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Minted a dedicated classic GitHub PAT scoped to read:packages and stored it in
Vault secret/viktor/ghcr_pull_token (2026-06-15), replacing the previous alias
of the broad admin github_pat. Propagated via targeted apply of
module.kyverno.kubernetes_secret.ghcr_credentials (Kyverno re-syncs the
allowlisted namespaces). Document the new cred + the manual rotation recipe.

Closes: code-h2il

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 20:19:17 +00:00
Viktor Barzin
0bfa6f0774 feat(anisette): self-hosted Apple anisette server for SideStore (infra #40)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Deploy a small stateless anisette-data server so the TripIt iOS Shell can be
sideloaded with SideStore using a free Apple ID, without brokering the
Apple-ID auth dance through a public third-party anisette server (which would
see every login). SideStore points at a stable internal endpoint we control.

- Image: Dadoum/anisette-v3-server, the de-facto standard anisette-v3 server
  for SideStore/AltStore. Upstream ships only a mutable :latest (no GitHub
  releases / semver / sha tags), so pinned by manifest digest instead of a tag
  per the "never :latest" rule. Pulled from DockerHub via the registry-VM
  pull-through cache like echo/cyberchef. Diun watches :latest (notify-only) so
  a new upstream build prompts a digest re-pin.
- Stateless: emptyDir backs the provisioning-library cache dir (regenerable
  download; upstream issue #23 means it doesn't preserve client auth across
  restarts anyway) — no PVC, no Vault secret.
- Internal-only endpoint http://anisette.viktorbarzin.lan (auth=none,
  allow_local_access_only, ssl_redirect off) — SideStore is a native client
  that can't do the Authentik cookie dance, same reasoning as android-emulator's
  adb. The .lan CNAME is auto-created by technitium-ingress-dns-sync; never
  publicly exposed.

Mirrors the echo/networking-toolbox/android-emulator stack pattern. Service
catalog updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 19:35:57 +00:00
Viktor Barzin
72982683bc docs(CLAUDE.md): k8s-portal now GHA->ghcr, not a Woodpecker build
All checks were successful
ci/woodpecker/push/default Pipeline was successful
k8s-portal was the last in-cluster image builder. Its .woodpecker/k8s-portal.yml
was deleted; it now builds on GHA (build-k8s-portal.yml) -> PRIVATE ghcr, pulled
via the Kyverno ghcr-credentials allowlist and deployed by Keel. Fix the CI/CD
section: drop k8s-portal from the Woodpecker-pipelines list (stale), move it from
'already on GHA' to the infra-owned private-ghcr images, and add it to the
PRIVATE ghcr allowlist roster. Completes the no-local-builds migration.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 16:10:56 +00:00
Viktor Barzin
3e82c64a76 docs: sync CI/CD docs to ADR-0002 final state (ghcr + Woodpecker deploy-only) [ci skip]
ADR-0002 is fully landed (issues #11-#32 closed): every owned image now
builds on GitHub Actions and pushes to ghcr.io/viktorbarzin/<name>, with
Woodpecker reduced to deploy-only. The Forgejo container registry is frozen
and emptied; there are no in-cluster image builds or CI test runs anywhere.
The docs still described the old hybrid topology (DockerHub builds,
Woodpecker-native owned-app builds, the per-pattern migration lists, the
tripit-only pilot framing), which would mislead future sessions and
incident response.

This brings the docs to the completed reality (closes #33):

- docs/architecture/ci-cd.md: full rewrite as the canonical CI/CD reference —
  the fleet GHA->ghcr->Woodpecker-deploy pattern, public/private ghcr package
  split, infra-owned image workflows (incl. infra-ci on ghcr), the frozen
  Forgejo registry, what Woodpecker still runs, and the #31 decommissions.
- .claude/CLAUDE.md: rewrite the "CI/CD Architecture" section to the
  fleet-wide final state; FIX the stale claim that claude-memory-mcp builds
  to DockerHub (it is GHA->ghcr); note owned images now live on ghcr and the
  Forgejo registry is frozen/break-glass near the image-registry bullet.
- .claude/reference/service-catalog.md: f1-stream is GHA->ghcr + Woodpecker
  deploy-only (was "Woodpecker-native build->deploy").
- stacks/{tuya-bridge,android-emulator}/variables.tf + stacks/terminal/main.tf:
  cosmetic description/comment updates (forgejo -> ghcr; terminal-lobby has no
  CI pipeline). Description/comment text only — no stack logic changed.

Historical records (docs/post-mortems/*, docs/plans/*) and ADR-0002 itself
are left untouched as point-in-time records.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 12:55:49 +00:00
Viktor Barzin
bb0f9f59ef docs: CI-compute doctrine — leverage external infra for builds AND tests [ci skip]
Viktor's standing instruction (2026-06-12): lean on external infra as
much as possible for CI — builds, running tests, lint, releases all on
GitHub Actions hosted runners, never on cluster nodes; in-cluster
pipelines only for cluster-touching steps (deploys, terragrunt,
certbot). Also: watch any triggered pipeline chain to completion and
fix failures immediately. Added to AGENTS.md + .claude/CLAUDE.md
CI sections (ADR-0002 companions).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:39:27 +00:00
Viktor Barzin
97dcf49b8e monitoring: reduce Slack alert noise (alert-on-change + daily digest)
Some checks failed
ci/woodpecker/push/build-cli Pipeline was successful
ci/woodpecker/push/default Pipeline was canceled
Reviewed the last 24h of Slack alerts after the midday node-pressure blip:
the volume came far less from the outage than from (a) alerts re-pinging
every few hours while nothing changed and (b) a pod cascade that fired
uninhibited. This hardens the alerting *system* so recurrences are quiet,
rather than just clearing today's broken services.

Changes (all in the monitoring module):

* Alert-on-change routing. warning/info repeat_interval -> 8760h (notify
  once, then only on a membership change or resolve); critical 1h -> 6h
  (a slow nag, not an hourly drip). send_resolved stays on. The bulk of
  the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired
  continuously for ~24h, re-notifying every 4h).

* Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at
  08:00 Europe/London: the full current board grouped by severity + what
  resolved in the last 24h. This is the standing-state safety net for the
  alert-on-change model. Stock python:3.12-alpine, pure-stdlib script
  (no pip/apk at runtime -> none of the per-run disk-write footprint that
  disabled status-page-pusher). Reuses the existing Alertmanager Slack
  webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus.

* Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the
  downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff,
  PodsStuckContainerCreating, ScrapeTargetDown, *ReplicasMismatch, ...).
  The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14
  PodImagePullBackOff uninhibited because only NodeDown was a source.

* T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst
  for the same leg — two alerts described one condition and were the #1
  noise source (~3,400 alert-minutes over 24h).

* ScrapeTargetDown false positives. Scrape only Ready endpoints, so
  completed CronJob pods that linger in EndpointSlices as NotReady
  addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready
  pod with a genuinely broken metrics endpoint still fires.

* for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/
  NfsMirror/Vzdump *Failing) and DNS spike detectors, so a single
  transient Pushgateway/scrape blip no longer fires-and-resolves.

* Added an Alertmanager scrape target: it carried no prometheus.io/scrape
  annotation, so notification volume was unmeasurable — now we can verify
  this change worked (alertmanager_notifications_total et al.).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 20:35:56 +00:00
Viktor Barzin
18f524c265 docs: ghcr-credentials is now Kyverno-synced to allowlisted namespaces [ci skip]
Same-change doc sync for infra#12: the tripit-ns-scoped interim secret
paragraph described the pre-ClusterPolicy state.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 20:31:55 +00:00