Commit graph

4732 commits

Author SHA1 Message Date
Viktor Barzin
d9717a53bf vault-token-renew runbook: document the self-heal behavior
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Drift guard section rewritten: admin-capable clobbers now self-heal at the
nightly run (HEALED log line); weak clobbers keep the loud DRIFT failure;
manual re-mint is only the weak-clobber recovery now.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:20:44 +00:00
Viktor Barzin
4a7b6db806 vault-token-renew: self-heal the periodic token on admin-capable clobber
Viktor asked for 'vault login -method=oidc' to work seamlessly: the OIDC
login the docs prescribe kept clobbering ~/.vault-token with a 7-day token,
and detect-only DRIFT failures went unnoticed for weeks (weekly-expiry
loop, twice in June). On drift the renewer now re-mints the periodic token
with the clobbering token's own authority (Vault's 403 is the judge — no
policy guessing), sanity-checks it, replaces the file atomically, and
revokes stale token-devvm-wizard leftovers. Weak/read-only clobbers still
fail loudly on purpose. Design: docs/plans/2026-07-03-vault-token-self-heal-design.md

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:20:00 +00:00
Viktor Barzin
8631709ca2 vault-token-renew: pure helpers for the self-heal revoke filter
vtr_accessor parses the accessor from lookup JSON; vtr_is_stale_periodic
decides which old token-devvm-wizard tokens a heal may revoke (never the
just-minted one, never foreign tokens, nothing when the keeper is unknown).
TDD red-green for the heal branch that lands next.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:19:09 +00:00
Viktor Barzin
a07a603b80 docs/plans: vault-token self-heal implementation plan
Task-by-task TDD plan for the approved self-heal design: pure-function
tests first, then the heal branch, runbook update, deploy + live clobber
simulation, landing and memory updates.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:09:36 +00:00
Viktor Barzin
e2bfb20c84 docs/plans: vault-token self-heal design (devvm renewer)
Viktor asked to make 'vault login -method=oidc' work seamlessly on devvm:
today any OIDC login clobbers the permanent periodic token in
~/.vault-token, the drift guard only logs the drift, and his access
effectively expires weekly. Approved design: the nightly renewer re-mints
the periodic token from any admin-capable clobber (weak clobbers keep
failing loudly) and revokes stale periodic tokens after each heal.
Implementation follows on this branch.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 20:02:53 +00:00
Viktor Barzin
6698018ab6 service-catalog: add tasks row + tasks to the proxied-domains list
Some checks failed
ci/woodpecker/push/default Pipeline failed
Docs-with-change convention: the new tasks stack (Reminders-style PWA over
Nextcloud CalDAV) gets its catalog entry — what it is, its CNPG db + Vault
static role, the auth=required/X-authentik-username trust model with the
SEC-1 NetworkPolicy, and the ADR-0002 CI/CD path — and tasks joins the
Cloudflare proxied hostname list.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:53:42 +00:00
Viktor Barzin
02640df620 stacks/tasks: new stack for the tasks PWA (Authentik-gated, CNPG-backed)
Deploys the Reminders-style tasks app at tasks.viktorbarzin.me: namespace,
ExternalSecrets (fernet_key from secret/tasks; TASKS_DB_DSN composed from
the pg-tasks static-creds password the tripit way), single-replica
Deployment of ghcr.io/viktorbarzin/tasks:latest (image ignore_changes per
the fleet set-image pattern; Reloader restarts it on the 7-day DB password
rotation; /healthz probes on 8000; Europe/Sofia local tz; DEV_USER
deliberately absent — security invariant), Service on 8000, and an
ingress_factory host with auth=required + dns_type=proxied since Authentik
forward-auth is the app's only gate. NetworkPolicy tasks-ingress (SEC-1)
limits pod ingress to the traefik namespace plus monitoring on 8000 for
/metrics, so the trusted X-authentik-username header cannot be spoofed by
other pods.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:53:27 +00:00
Viktor Barzin
e0db1054e7 dbaas+vault: provision tasks CNPG database, role and rotating password
The new tasks PWA (Reminders-style front-end over Nextcloud CalDAV, per
tasks/docs/2026-07-03-tasks-pwa-design.md) needs its own Postgres database
for Connected Accounts and sync state. Follows the tripit/job_hunter
pattern exactly: idempotent null_resource creates role+db on the CNPG
primary with a placeholder password, and the Vault database engine static
role pg-tasks (added to the postgresql connection allowed_roles) rotates
the real password every 7 days, consumed by the tasks stack via a
vault-database ExternalSecret.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 19:53:13 +00:00
Viktor Barzin
9dcd3b0d5d Merge remote-tracking branch 'forgejo/master' into wizard/stem95su-cutover
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 15:27:04 +00:00
Viktor Barzin
5367d4a055 paperless-mail-ingest: rules process inline attachments (Apple Mail lesson)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor's first real forward carried the invoice PDF with
Content-Disposition: inline (Apple Mail does this for real documents),
and the attachments-only rules consumed nothing — recorded
PROCESSED_WO_CONSUMPTION, which also blocks reprocessing. Flipped all 5
rules to attachment_type=2 (process inline) via the API and documented
the trade-off + the ProcessedMail unblock step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:25:44 +00:00
Viktor Barzin
21c6e7112e stem95su: retire the in-cluster serving stack — now a Valia site on Pages
Completes the ADR-0018 cutover. The stack is emptied to a tombstone so
CI destroys nginx, the NFS content volume, the ingress, the per-site
gdrive-sync CronJob and the namespace; serving + sync are owned by
stacks/valia-sites since the cutover commits. Catalog + runbook updated
to the migrated state (incl. the one-time 42.9→21.4MB video compression
Viktor approved).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:22:32 +00:00
Viktor Barzin
974c9976e3 valia-sites: take over stem95su DNS (manage_dns=true) — cutover half 2
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Creates the public proxied CNAME stem95su -> stem95su.pages.dev and
adds the internal split-horizon entry via the valia-sites-dns
ConfigMap (the sync's update pass repoints the existing internal
record). Completes the ADR-0018 cutover; the old in-cluster serving
stack is retired in a follow-up.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:21:18 +00:00
Viktor Barzin
5c8e9daabd stem95su: release the public CNAME (dns_type=none) for the Pages cutover
All checks were successful
ci/woodpecker/push/default Pipeline was successful
First half of the ADR-0018 stem95su cutover: the tunnel-target CNAME is
destroyed so stacks/valia-sites can create the Pages-target record for
the same name (Cloudflare allows one CNAME per name; the follow-up
commit flips manage_dns=true there). stem_video.mp4 was compressed to
21.4MB with Viktor's explicit OK, clearing the 25MB Pages cap; content
is already deployed on the stem95su Pages project. Brief public
NXDOMAIN window between the two applies is accepted.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 15:21:18 +00:00
Viktor Barzin
c1ee6863b3 mailserver docs: troubleshooting entry for the postsrsd 100%-CPU spin
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Hit during the docs@ rollout: after a pod restart postsrsd came up
spinning without binding its TCP ports, so postfix cleanup tempfailed
every message with 451 queue file write error. Document the signature
and the supervisorctl-restart / pod-recreate fix.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:39:13 +00:00
Viktor Barzin
4ee4d1927d mailserver: guard alias filter against short lines with a lazy ternary
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
CI pipeline 469 failed with 'Invalid index' on the postfix_virtual alias
filter: terraform only short-circuits &&/|| from v1.6, and the older
terraform in the infra-ci image still evaluated split(" ", line)[1] for
the blank and comment lines that have been in extra/aliases.txt since the
plans@ block. The devvm's newer terraform short-circuits, which is why the
local apply of the same commit passed. A conditional expression is lazy on
every terraform version, so move the length guard into a ternary.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:38:30 +00:00
Viktor Barzin
68b9858eff paperless-mail-ingest runbook: manual mail_fetcher must drop to the paperless user
All checks were successful
ci/woodpecker/push/default Pipeline was successful
A root-run kubectl exec mail_fetcher downloads attachments root-owned into
the scratch dir and the celery consumer (uid 1000) fails with
PermissionError — found during the build E2E. Document s6-setuidgid usage
and the recovery step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:26:12 +00:00
Viktor Barzin
77fcb08e8e mailserver: add docs@ paperless ingest mailbox (sieve sender allowlist)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Viktor asked to forward arbitrary emails with PDF attachments into
paperless-ngx, with the forwarding sender mapping 1:1 to the paperless
account that owns the document. paperless-ngx's built-in IMAP consumer
already does the sender->owner mapping, so the infra half is a dedicated
real mailbox docs@viktorbarzin.me: an explicit self-alias (the @domain
catch-all would otherwise divert it into the TripIt-swept spam@ mailbox,
whose sweeper LLM-parses and auto-replies to mail from linked senders)
plus a per-user Dovecot sieve that discards non-family senders at
delivery (chosen behaviour for unmatched senders: ignore and delete;
also keeps spam out of the guessable address). The mailbox credential
was added to Vault secret/platform.mailserver_accounts. Paperless-side
mail account + 5 per-sender rules are DB state, configured via the API
per the new runbook docs/runbooks/paperless-mail-ingest.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 14:06:19 +00:00
Viktor Barzin
f5187806f9 ADR-0017: replace ASCII trunk diagram with excalidraw VLAN-tagging diagram
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants the traffic-flow view as a colored excalidraw instead of
the ASCII block (which was the only thing rendering after the earlier
VLAN-tagging SVG commit failed to push — a locally-masked non-fast-
forward this session, not a merge clobber). Ships both the editable
.excalidraw scene and a hand-drawn-style SVG export embedded in the
Traffic-on-the-trunk section: two lanes showing where the 802.1Q tag
is added, carried (only P5<->vmbr0) and stripped, L2 membership drops
vs L3 firewall verdicts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 13:21:59 +00:00
Viktor Barzin
316cdb7441 docs: valia-sites runbook + dns.md CM mechanism + service-catalog entries
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Runbook covers add/update/retire (one map entry; internal DNS now
cleans up after itself), content rules for Valia's folders, and the
failure modes incl. both token re-mint paths. dns.md superset-rule
paragraph now describes the declarative ConfigMap reconcile instead of
hand-added static CNAMEs. Catalog: new valia-sites row; stem95su row
notes its Pages cutover is parked on the 42.9MB stem_video.mp4
exceeding the 25MB Pages per-file cap.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:46:24 +00:00
Viktor Barzin
4a3c8287c3 Merge remote-tracking branch 'forgejo/master' into wizard/valia-sites
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 12:43:28 +00:00
Viktor Barzin
e0991853e4 valia-sites: 25MB Pages-limit guard; cloudflared: drop removed{} (CI TF <1.7)
Two fixes from the first live runs. (1) The sync job now skips a whole
site when any file exceeds Cloudflare Pages' 25MB per-file cap, leaving
current serving untouched — stem95su's stem_board.html references a
42.9MB stem_video.mp4, which made every run fail; the guard turns that
into a loud skip so bridge keeps syncing. (2) The CI terraform is older
than 1.7 and rejects removed{} blocks anywhere (pipelines 461/464), so
the bridge record handoff was completed with a one-time manual
'tg state rm module.cloudflared.cloudflare_record.bridge_pages' from
the main checkout; the block is deleted and the module comment records
the manual step.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:43:13 +00:00
Viktor Barzin
348f64d34d ADR-0017: add physical-cabling diagram (wires only)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for one diagram showing just the physical connections
between nodes, separate from the logical/VLAN topology: ISP->AX6000,
the in-wall apartment->garage run into P1, 4G router (cellular OOB),
UPS mgmt, the PoE cat6 to the camera, the LAN1 cable to eno1, dark
eno2 fallback + free eno3/4, iDRAC on shared-LOM, and the note that
everything else on the R730 is virtual. Referenced from the ADR next
to the logical SVG.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:40:29 +00:00
Viktor Barzin
126cf4c88e Merge origin/master into wizard/cctv-adr-trunk
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 12:32:00 +00:00
Viktor Barzin
695e020111 cloudflared: move bridge removed{} to stack root — removed blocks are root-module-only
Some checks failed
ci/woodpecker/push/default Pipeline failed
Pipeline 461 failed terraform init: the removed{} handoff block sat in
the stack-local module, but Terraform only allows removed blocks in the
root module. Same intent, correct position (from =
module.cloudflared.cloudflare_record.bridge_pages, destroy=false).
Without this the stale state entry would make the next cloudflared
apply destroy the record valia-sites now owns.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:31:53 +00:00
Viktor Barzin
5d16a18cf4 ADR-0017: document trunk traffic semantics + ASCII topology
While reviewing the single-switch design Viktor asked whether both the
home LAN and the camera VLAN 'go via pfSense which forwards upstream' -
a natural misreading a future reader would repeat. Added a section
spelling out the vmbr0 fork: untagged home LAN is L2-bridged past
pfSense (gateway stays the AX6000, rack outage does not affect it, OOB
via 4G survives), while tagged-30 can only land on the dCCTV interface,
making a pfSense bypass impossible by construction. Includes a compact
ASCII topology for terminal readers alongside the SVG.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:31:48 +00:00
Viktor Barzin
8b80b4cc41 valia-sites: registry stack for Valia's Pages sites + declarative internal DNS (ADR-0018)
Some checks failed
Build valia-sites-sync / build (push) Waiting to run
ci/woodpecker/push/default Pipeline failed
Valia keeps asking Viktor to host 1-page sites from her Drive folders;
this makes it one map entry. New stacks/valia-sites: per site a CF Pages
project + custom domain + proxied CNAME (bridge adopted via import{}),
a ConfigMap feed (valia-sites-dns) the technitium ingress-dns-sync
script now reconciles internal CNAMEs from (add/update/REMOVE — fixes
the add-only stale-record gotcha), and one shared 10-min CronJob that
mirrors each Content folder (rclone, drive.readonly, stem95su's guards)
and wrangler-deploys ONLY on manifest change (free-tier deploy cap).
Scoped CF Pages token + shared rclone conf in secret/valia-sites; the
Global API Key never enters a pod. cloudflared forgets bridge's record
via removed{} (no destroy). stem95su is in the map dns-parked
(manage_dns=false) until its cutover commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:28:06 +00:00
Viktor Barzin
5c42155b81 docs: Valia-sites domain language + ADR-0018 (off-infra Pages, in-cluster sync)
Grill session with Viktor: his mother Valia will keep asking for 1-page
site hosting, so the pattern is being made repeatable. Decisions: all
Valia sites serve off-infra on Cloudflare Pages (survive homelab
outages); one shared in-cluster CronJob mirrors her Drive folders every
10 min and redeploys on change; English subdomain names picked by
Viktor; failed-Job-only visibility; stem95su migrates onto the pattern.
CONTEXT.md gains Valia site / Content folder / Entry file; full
rationale and rejected options in ADR-0018.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 12:17:45 +00:00
Viktor Barzin
e1bd111562 rename CF Pages site most.viktorbarzin.me -> bridge.viktorbarzin.me
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to rename the 'мост' school static site to 'bridge'.
New Cloudflare Pages project 'bridge' (bridge-cv2.pages.dev) already
deployed and the custom domain attached; this renames the public CNAME
(TF resource most_pages -> bridge_pages, destroy+create swaps the
record) and the internal split-horizon static CNAME in the
ingress-dns-sync CronJob. The old 'most' Pages project and the stale
internal 'most' record are removed out-of-band after this applies.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:52:30 +00:00
Viktor Barzin
7dd80b6c7c technitium: mirror most.viktorbarzin.me into the internal zone (CF Pages site)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The internal split-horizon zone is authoritative for viktorbarzin.me,
so the new Cloudflare Pages site (most.viktorbarzin.me, added for
Viktor's 'мост' school static site) NXDOMAINed for every internal
client — LAN, VLANs and pods — while resolving fine externally.
Per the superset rule, add it as a static CNAME (-> most-6if.pages.dev)
in the ingress-dns-sync CronJob next to the mail-auth records, and
document the off-infra-site case in dns.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:10:46 +00:00
Viktor Barzin
217a54be9d cloudflared: add most.viktorbarzin.me CNAME for Cloudflare Pages site
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to host a static HTML site (the 'мост' school project,
ОбУ „Отец Паисий", pulled from his Google Drive) on Cloudflare Pages
with a custom domain, as a try-out of Pages hosting. The site content
is deployed off-infra via wrangler to the Pages project 'most'
(most-6if.pages.dev); this CNAME points most.viktorbarzin.me at it.
The custom domain is already attached to the Pages project and is
waiting on this DNS record to validate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 10:06:33 +00:00
Viktor Barzin
be80ef23bb ADR-0017 rev 3: single switch — PE replaces the SG105E, CCTV rides a VLAN-30 trunk on the LAN1 cable
Viktor prefers not running two switches, so the TL-SG105PE takes over
all rack duties (apartment uplink, 4G, UPS, camera PoE) and the CCTV
segment moves onto a managed tagged trunk over the existing LAN1 cable:
pfSense net3 re-pointed from vmbr2 to vmbr0 tag=30 (applied live; same
MAC so vtnet3/dCCTV survived untouched). This is safe where the original
802.1Q rejection was not, because the managed switch is the only device
on eno1 and polices VLAN-30 membership. eno2/vmbr2 kept dormant as the
documented fallback. Old SG105E retires to cold spare; PE inherits
192.168.1.6. Glossary Segment term updated (all three segments are now
bridge-tags feeding untagged pfSense vNICs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 09:15:52 +00:00
Viktor Barzin
4082934bc1 Merge origin/master into wizard/cctv-two-switch
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-03 08:37:34 +00:00
Viktor Barzin
e11bd6e893 ADR-0017 rev 2: two switches — the PE is a dedicated CCTV island, no VLAN table anywhere
Viktor asked to verify free ports on the garage switch (192.168.1.6)
before finalizing. Logging into it showed it is NOT the TL-SG105PE from
the plan but a pre-existing non-PoE TL-SG105E with 4 of 5 ports in use
(apartment uplink, R730 LAN1, 4G router, UPS) - the single-shared-switch
port-VLAN design written earlier today was based on conflating the two
devices. Corrected: the new TL-SG105PE carries ONLY camera + eno2
uplink (mgmt 10.0.30.6 inside the segment), the old switch is untouched,
and no VLAN config exists anywhere. ADR, topology SVG and networking.md
updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-03 08:37:15 +00:00
Viktor Barzin
08fb65827c tripit: set PLACE_PHOTO_PROVIDER=wikipedia — real place preview photos
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for place photos on the tripit Trip board. The app-side
work (add-time photo fetch, board place cards) shipped in tripit
v0.106.0, but prod never set PLACE_PHOTO_PROVIDER, so the fake provider
would store placeholder PNGs for every hand-added place. Same class of
fake-default gap as PLACE_RESOLVER_MODE (set explicitly for the same
reason); the ADR-0035 rollout had left both the env flip and its
backfill cron undone.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 21:57:21 +00:00
Viktor Barzin
b761701994 ADR-0017: add network topology diagram (SVG) next to the decision
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked for a reviewable network visualization committed alongside
the CCTV-segment ADR. Hand-drawn SVG (renders on Forgejo, validated
palette): physical path camera -> TL-SG105PE port-VLANs -> eno2/vmbr2 ->
pfSense dCCTV, the firewall flows (Frigate RTSP, ha-sofia ISAPI/RTSP,
NTP-only egress, default deny), and the dashed camera-day steps (patch
cable, cat6 run, AX6000 static route).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 20:25:28 +00:00
Viktor Barzin
248e186dce CCTV segment (dCCTV 10.0.30.0/24) on a dedicated pfSense leg for the garage camera
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor and emo are adding the first owned camera at the Sofia site (HiLook
IPC-T241H-C watching the garage / server rack). Viktor asked to finalize
emo's plan; the grilling session resolved emo's five open decisions and
replaced the doc's 802.1Q-trunk idea with the site idiom: a dedicated
physical leg (R730 eno2 -> vmbr2 -> pfSense net3 = dCCTV 10.0.30.1/24),
port-based VLAN split on the shared TL-SG105PE, camera default-deny with
NTP-only egress, Frigate + ha-sofia as the only consumers.

The PVE bridge, pfSense interface, Kea subnet and firewall rules were
applied live this session (hand-managed hosts, backed up). This commit
records the decision (ADR-0017), the glossary terms (Segment / CCTV
segment), the as-built architecture doc, and bumps Frigate's ADR-0016
VRAM budget 2000 -> 2300 MiB for the upcoming NVDEC stream.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 20:01:45 +00:00
3a5194c9d4 Merge pull request 'immich(frame-emo): show photos from the last 365 days (was 730)' (#18) from emo/frame-emo-1year into master
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reviewed-on: #18
2026-07-02 19:05:31 +00:00
9e253d409a immich(frame-emo): show photos from the last 365 days (was 730)
Emil asked his Sofia Portal Mini photo-frame to show only the past
year of photos rolling from today, instead of the last two years.
Changes ImagesFromDays 730 -> 365 in the frame-emo Settings.yml.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 19:05:31 +00:00
Viktor Barzin
4c532dbf97 devvm containment: drop the MemoryHigh throttle band, straight to MemoryMax OOM
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
t3.viktorbarzin.me went down 2026-07-02 15:42-16:35 UTC: an agent-spawned
12.3G ugrep plateaued inside t3-serve@wizard's MemoryHigh(12G)..MemoryMax(16G)
band. With MemorySwapMax=0 its anon pages were unreclaimable, so the kernel
throttled every task in the cgroup indefinitely (memory.pressure full ~80%,
oom_kill never fired) - the t3 event loop starved, the accept queue rotted,
and the terminal was dead until the hog was SIGKILLed by hand.

The 2026-06-22 design assumed 'throttle to a crawl, then OOM locally'; a hog
that stabilises between high and max never OOMs, so the throttle band is a
livelock zone, not a safety layer. Viktor asked to close that gap: MemoryHigh
is now explicitly infinity on all three work cgroup definitions (t3-serve@
unit, user-<uid>.slice drop-in, docker.slice) so a runaway is cgroup-OOM-
killed at MemoryMax immediately - OOMPolicy=continue already keeps the t3
server alive when a child dies. MemoryMax/MemorySwapMax=0/earlyoom unchanged.
Applied live to the devvm the same day (daemon-reload + runtime set-property
on running cgroups, no session restarts). Post-mortem addendum + runbook
updated in the same commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 16:59:38 +00:00
Viktor Barzin
684ca4527c docs(CLAUDE.md): T4 now has a VRAM budget + watchdog (ADR-0016, dry-run); note llama-swap budget miscalibration
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Session wrap-up doc sync: the Immich note still claimed the shared T4 had no
VRAM isolation. Record the gpumem budget/watchdog shipped earlier today, that
the watchdog is observe-only, and that budgets need a retune (llama-swap's
real 16k-ctx resident is ~7GB, not 4.35) before arming.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 15:20:06 +00:00
Viktor Barzin
21afae85c9 dawarich: dedicated 100/1000 Traefik rate limit (default 10/50 429'd page loads)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor saw dawarich throwing 429s through Traefik and asked to loosen
the burst for it. The access log confirms the burst pattern: one page
load fires the whole fingerprinted-asset tail (SVG store badges,
favicons, webmanifest) from a single client IP and trips the default
10 req/s / burst 50 limiter (repro: 80 parallel GETs -> 28x 429).
Same remedy as ha-sofia, ActualBudget, noVNC, tripit, health and
authentik: dedicated dawarich-rate-limit middleware (average 100 /
burst 1000) + skip_default_rate_limit on the dawarich ingress. Also
updates the networking.md middleware enumerations (adding the
previously undocumented tripit/health limiters alongside dawarich).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 15:03:08 +00:00
Viktor Barzin
91d0213d1a Merge remote-tracking branch 'forgejo/master' into wizard/excalidraw-export-rename
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build excalidraw-library / build (push) Has been cancelled
2026-07-02 14:29:34 +00:00
Viktor Barzin
8fc657f431 excalidraw: migrate image build to GHA -> private ghcr (ADR-0002)
The image was still built by hand and pushed to DockerHub (v1..v4),
predating the all-builds-off-infra doctrine; Viktor chose to move it
onto the standard pipeline while shipping the export/rename feature
rather than keep the manual flow.

Mirrors the k8s-portal pattern: .github/workflows/build-excalidraw.yml
(go test + buildx linux/amd64, pushes ghcr latest+sha), excalidraw ns
added to the Kyverno ghcr-credentials allowlist (package is PRIVATE),
deployment now pins ghcr :latest with pullPolicy Always + pull secret,
Keel force/match-tag/5m annotations seed the metadata (live values win
via ignore_changes). DockerHub viktorbarzin/excalidraw-library:v4 stays
frozen as the rollback image. Docs: ci-cd.md + .claude/CLAUDE.md image
lists updated (also backfilled the missing k8s-portal rows in ci-cd.md).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:29:23 +00:00
Viktor Barzin
1cbc1e962b excalidraw: native export menu + drawing rename
Users couldn't see Excalidraw's built-in Save as / Export image options:
the app's custom toolbar was drawn exactly on top of the native hamburger
menu button, hiding it. Removed the overlay and integrated Back to
Library / Save now / Rename into the native menu, so the native export
formats (.excalidraw file, PNG, SVG, clipboard) are now reachable.
Viktor asked for exports to work via the native Excalidraw feature and
for drawings to be renameable by clicking their name.

Rename: new PATCH /api/drawings/{id} endpoint (server-side name
sanitization, 409 on conflict) + click-to-rename title pill in the
editor (updates URL in place) + Rename button/modal in the dashboard.
Existing GET/PUT/DELETE semantics unchanged for API compatibility
(emo's upload pipeline). Added main_test.go (httptest) covering rename
+ existing handler behavior; dashboard rows now DOM-built (XSS-safe).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:29:10 +00:00
Viktor Barzin
d94f267c93 immich: upgrade v2.7.5 → v3.0.0 (postgres → vectorchord 0.4.3, frames → immich_v3 tag)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to upgrade Immich to the just-released v3.0.0 (release notes,
migration guide and release discussion #29439 reviewed — no config-breaking
changes for this stack: we already use the split MACHINE_LEARNING_PRELOAD
vars, don't set DB_VECTOR_EXTENSION, OAuth goes through Authentik over
HTTPS, and the GPU node's CPU meets the new x86-64-v2 requirement).

The Immich Postgres image moves to VectorChord 0.4.3 to match the upstream
v3 reference stack (0.3.0 is still within v3's supported range '>=0.3 <2';
Immich upgrades the extension itself at startup). Both photo frames switch
to ImmichFrame's immich_v3 compatibility tag because every versioned
ImmichFrame release (≤ v1.0.33.0) crashes deserializing Immich v3 API
responses; repin to a versioned tag once upstream ships stable v3 support.

Deployment images are Keel-managed (KEEL_IGNORE_IMAGE, policy=patch), so
this commit is the source-of-truth record; the live rollout happens via
kubectl set image in the same session. Pre-upgrade pg_dumpall taken
(job postgresql-backup-pre-v3).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:18:22 +00:00
Viktor Barzin
6f03ccd1aa excalidraw: grant emo-browser SA port-forward for drawing uploads
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to fix emo's permission so his Claude can upload to the
Excalidraw service. emo's recent sessions show the documented upload
recipe (kubectl port-forward svc/draw + X-Authentik-Username header,
from his ~/.claude/CLAUDE.md) failing with:

  pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser
  in namespace excalidraw

because his default kubeconfig is the read-only emo-browser SA (its
port-forward grant covers only chrome-service) and his old admin
kubeconfig at /home/emo/code/config expired and was removed.

Add a namespace-scoped Role (pods/portforward create) + RoleBinding for
that SA in the excalidraw namespace, mirroring the 2026-06-28
chrome-service grant. Trade-off (any-user drawings via the trusted
username header) documented in the file and accepted.

Also record the grant in docs/architecture/chrome-service.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 11:08:28 +00:00
Viktor Barzin
88c86e2109 ci: Slack-notify failed pipeline runs only
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor doesn't want a Slack message for every CI run — only failures.
The infra apply pipeline posted a status line to #general on every push,
and the renew-tls / postmortem-todos / registry-config-sync /
pve-nfs-exports-sync crons posted on every scheduled run (~30+ routine
messages a week). Now: the apply pipeline's success post is gone
(notify-failure already covers failures), all cron notifies are
status:[failure] with explicit FAILED texts, and drift-detection is
silent when all stacks are clean (still posts drift findings and errors,
and gains a hard-failure catch step it previously lacked). Kept:
notify-nonadmin-push (org audit feed) and the actionable provision-user
post. Per-app deploy template in ci-cd.md updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:27:43 +00:00
Viktor Barzin
a64d2ba2b9 upgrades: fix hourly gotenberg error + cap update notifications at weekly
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor was getting upgrade-error Slack messages every hour and wants
update notifications at most weekly. Root cause of the errors: Keel kept
trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's
require-trusted-registries denied it — gotenberg/* (and apache/*, which
tika will hit next) were never allowlisted, and Keel's Slack notifier at
info level re-posted the identical failure to #general on every hourly
poll since Jun 28.

Changes: allowlist gotenberg/* + apache/* so the patch applies cleanly;
disable Keel's direct Slack notifier and replace failure visibility with
a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification
plus the daily digest, never an hourly drip); remove diun's Slack
notifier whose default message @channel-pinged #image-updates for every
new upstream tag every 6h (the n8n upgrade-agent webhook feed is
untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC).
Paperless-ngx itself stays paused (keel policy=never, user-managed) while
the ingest runs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:16:50 +00:00
Viktor Barzin
5d5d9752cb guard: ignore + git-crypt kubeconfig files so they can't leak to the public mirror
All checks were successful
ci/woodpecker/push/default Pipeline was successful
A GitGuardian audit of the infra repo showed the recent alerts were test
fixtures (false positives), but surfaced a real historical leak: a
cluster-admin kubeconfig was once committed as stacks/f1-stream/.../.config
(now expired, reachable only via a GitHub PR ref). The .gitignore already had
a `config` rule for kubeconfigs but missed the dotfile form `.config` — which
is exactly how that file slipped onto the public mirror.

Close the gap in two layers:
- .gitignore: also ignore `.config`, `kubeconfig`, `*.kubeconfig`,
  `admin.conf`, `.kube/` so they're never staged by accident.
- .gitattributes: route `.config`, `kubeconfig`, `*.kubeconfig`, `admin.conf`
  through git-crypt so a force-add or rename still lands as ciphertext (never
  plaintext) on the public GitHub mirror.

No tracked files match these names today, so there is zero retroactive impact
— purely forward-looking prevention.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 07:14:58 +00:00
Viktor Barzin
dab307f9f8 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-07-02 05:39:15 +00:00