Reflects the write-reduction params applied in c3553731, and documents the
null_resource trigger-bump + targeted-apply gotcha so the next agent doesn't
hit the inert-change / mysql-VCT-drift traps.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Part of the sdc IOPS-reduction work (code-oflt). 462 daily thin snapshots
(66 PVCs x 7d) drive ~10-34 w/s of thin-pool metadata (tmeta) CoW writes on
the contended sdc spindle and pin ~2TB in the 70%-full pool. Halving to 3
days roughly halves both. Instant-restore window shrinks 7->3d; daily-backup
still keeps 4 weeks of file-level PVC history, so DR coverage is unchanged.
Deployed to the PVE host via scp (these host scripts are scp-deployed, not
TF-managed). Doc updated in .claude/CLAUDE.md.
Refs: code-oflt.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reusable Workflow script that audits whether the cluster is memory-overcommitted and whether a single k8s worker can be removed to return RAM to the PVE host without sacrificing N-1 failover. Read-only throughout: gathers PVE host memory (qm config / free / KSM via SSH), k8s per-node capacity + cluster 30d peak working set, and per-workload right-sizing, then models N-1 two ways (physical actual-usage and scheduling-by-request) and adversarially verifies the conclusion with 3 skeptics.
Sizes requests (scheduling reservation) and limits (OOM ceiling) as SEPARATE knobs — an earlier ad-hoc pass conflated them by sizing requests to 30d peak, which manufactured a false N-1 shortfall. Invoke via Workflow {scriptPath}, or by name when cwd is the infra repo.
Requested by Viktor: identify memory overcommit and whether deployment requests can be trimmed to free PVE host RAM by removing a node, without sacrificing service reliability.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to reduce CNPG checkpoint/WAL writes as part of the sdc
IOPS-isolation work (code-oflt). The IOPS deep-dive found CNPG checkpoints
fire 100% on the 5-min timer (checkpoints_timed >> checkpoints_req), each
triggering a full-page-write burst + flush onto the contended 7200rpm sdc
spindle -- a top write-IOPS source after etcd.
Set checkpoint_timeout=15min + max_wal_size=4GB + min_wal_size=1GB so
checkpoints fire ~1/3 as often (fewer FPW) and WAL segments are recycled
rather than churned. All three are sighup-reloadable -> CNPG applies them
without a restart or failover. checkpoint_completion_target stays 0.9 so
each checkpoint's IO is still smeared across the interval. Bounded
recovery-time tradeoff (more WAL to replay on crash), acceptable for the
write relief. wal_compression left at pglz ('on') pending image
zstd-support verification.
Also refreshes the stale CNPG tuning note in .claude/CLAUDE.md (it listed
shared_buffers=512MB / effective_cache_size=1536MB / 2Gi; live is 1024MB /
2560MB / 3Gi).
Refs: code-oflt (etcd/sdc IO isolation).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for
~20 min while internal routing + Unbound stayed up; recovery needed a manual
reboot and NOTHING alerted — there was no egress probe and the cloudflared
replica metric stayed green. Add first-class egress monitoring so the next
occurrence pages in ~2 min instead of being noticed by a human.
- blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW
so ICMP can use raw sockets).
- Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that
failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 +
1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers).
- Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable,
InternetEgressDown (both providers dead), ExternalDNSResolutionDown,
EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's
exact "external down while internal up" signature), PfSenseVMDown.
- Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the
cloudflared replica metric is blind to tunnel-connection loss. Threshold
calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident).
- Alertmanager inhibit: WAN/egress-down suppresses the downstream egress
symptom alerts so one root alert pages, not a storm.
- Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md.
All metric names + the cloudflared threshold verified against live
Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening
(dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred
and documented in the runbook.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets
the SFE too, and the SFE login shows social-login buttons (emo is Google-only with
no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md +
authentication.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the
SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE
login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth
downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GHA built ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch1 (public, verified
anonymously pullable). Point global.image at it (repository + tag pinned
explicitly so neither helm's appVersion default nor Keel can downgrade it — the
2026-06-10 boot-storm class) and remove keel.sh/enrolled from the namespace so
Keel won't auto-bump the custom tag. authentik is now manual-upgrade: bump the
Dockerfile FROM + this tag together on each authentik version bump.
Net effect once rolled: the identification-stage query drops ~1.4s -> ~14ms, so
the cold login-flow first-load stops being slow. (Does NOT affect old-browser
clients — iPadOS<=15/Safari<=15.6 still can't run the SPA; that's unfixable
server-side.) Docs: .claude/CLAUDE.md Authentik row.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog
added in 8d1d2fb9 was treating a symptom). The tigera operator's own `whisker`
NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the
kube-dns *pods* (podSelector k8s-app=kube-dns). But whisker-backend resolves
goldmane.calico-system.svc via the kube-dns *ClusterIP* (10.96.0.10), and Calico
drops UDP DNS to a ClusterIP under a podSelector-only egress rule.
Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100%
timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy
resolves fine; a test pod with the operator's podSelector-only egress rule
reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to
100% ok. whisker-backend resolves goldmane once in the brief startup window
before the policy programs, holds its long-lived gRPC stream, and only
re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked
ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable
aggregator (separate pod, unrestricted namespace) was never affected.
Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip
(whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so
the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop
(repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns
now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace
list. Docs (runbook + CLAUDE.md) updated to the real root cause.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The login screen would sometimes hang/blank for everyone for ~30s at a time.
Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a
transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3
goauthentik-server pods dropped out of the Service at once, so Traefik had no
healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the
rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` —
so live ran the chart-default 25%/25% and dropped a pod out of rotation on every
roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on
PostgreSQL and request-serving is coupled to PG — verified there is no
external-cache option to put back, so a SHORT transient is now survived but a
total CNPG outage still takes authentik down.)
Reliability package (R2, approved):
- readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover
reconnect without dropping the whole fleet from the Service.
- rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key)
and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready.
- gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9
workers' recycles don't cluster on a DB blip.
- / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000)
from the previous commit (skip_default_rate_limit) — fixes the cold-load 429
blank screen.
Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200,
so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the
pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md
(also corrected a stale "60s persistent DB connections" note).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unauthenticated users were getting a blank login screen (and the screen would
sometimes just hang). Root-caused via a read-only fan-out + adversarial verify:
the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through
the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and
a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was
the only first-party SPA still on the default limiter (8 siblings already have a
carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket).
- traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000,
mirroring the existing health/tripit carve-outs). The authentik / and /static
ingresses switch to it in the authentik-stack commit.
- monitoring: the `traefik` scrape job's drop-regex was a blanket
`traefik_router_.*`, which also dropped `traefik_router_requests_total` — so
per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable.
Narrowed it to keep the counter while still dropping the high-cardinality
`*_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh`
for the episodic all-3-server-pods-NotReady 502/503/504 cascade.
Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Whisker showed an empty UI on 2026-06-28. Root cause: whisker-backend dials
goldmane:7443 over a long-lived gRPC stream; when that stream dropped during a
transient CNI/DNS blip (right after k8s-node5 finished its v1.35.6 upgrade, its
pod resolver briefly timed out on the kube-dns ClusterIP) the Go gRPC resolver
got WEDGED — spamming "failed to stream flows" / "code = Unavailable: dns ...
i/o timeout" forever, never reconnecting. The operator ships whisker-backend
with NO liveness probe, so nothing restarted it; the live UI stayed blank until
a manual `kubectl delete pod`. (The durable aggregator is a separate pod and
was unaffected — only Whisker's ~60-min live view went dark.)
Whisker is operator-managed (Whisker CR), so we can't inject a liveness probe.
Instead add a watchdog so this never needs a manual restart again:
- whisker-watchdog CronJob (every 10 min) + least-privilege SA/Role/RoleBinding
(calico-system only: pods get/list/delete, pods/log get).
- It restarts the whisker pod only when whisker-backend logs >=10 goldmane-
connection errors in 11m AND Goldmane is Ready (the Goldmane-Ready guard
avoids restart-thrash during a real Goldmane outage).
- Self-tested: a manual run reports "whisker-backend healthy: 0 ... errors"
and does not restart.
Docs: runbook gains a "Whisker UI empty" troubleshooting entry + a self-heal
note; the stale 2026-06-25 "digest never posted" known-state block is updated
to Resolved (digest posts to #alerts, lastSuccessfulTime current); CLAUDE.md
flow-trail bullet gains the whisker-wedge gotcha.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to change the infra apply guidance: instead of 'never apply
locally, always rely on CI', the policy is now 'you MAY apply locally, but
always commit the change to the infra repo'.
- .claude/CLAUDE.md (Critical Rule: Terraform Only): new bullet making local
apply explicit (scripts/tg apply / homelab tf apply) from the MAIN checkout
(not a worktree — git-crypt'd tfvars read as ciphertext there), with a hard
requirement that every applied change is committed + pushed to master the same
session so the repo stays the source of truth and CI drift-detection doesn't
revert it. Spells out the apply<->commit ordering both ways.
- AGENTS.md (non-admin workstation land steps): step 5 now notes local apply as
an option alongside CI auto-apply, with the same 'always committed, never
applied uncommitted' rule.
Note: the org-managed settings block also frames CI auto-apply but is not
editable from a workstation clone.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The dedicated #security Slack channel was unreachable: the shared incoming
webhook (Vault secret/viktor -> alertmanager_slack_api_url) belongs to a
Slack app that isn't a member of #security, so any channel override on it
returns HTTP 404 channel_not_found. The goldmane-edges-digest was silently
failing for that reason.
Per request ("dump the security channel, post in an existing one"), route
everything to #alerts instead:
- alertmanager slack-security receiver -> #alerts (keeps its [SECURITY/<sev>]
title styling so security-lane alerts still stand out in the shared channel)
- goldmane-edges-digest CronJob SLACK_CHANNEL -> #alerts (comment only; value
was already switched and applied last change)
- AggregatorDown / DigestFailing alert summaries reworded to say #alerts
- docs swept (security.md, monitoring.md, ADR-0014, goldmane runbook,
.claude/CLAUDE.md, service-catalog, CONTEXT.md) to drop the
"invite the app / flip back to #security" caveats and state the
#security abandonment + #alerts consolidation as the current routing.
Monitoring stack applied (alertmanager rolled, live config verified:
slack-security channel is now #alerts).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the Goldmane who-talks-to-whom trail (ADR-0014), implemented by a
subagent workflow (distinct stacks in parallel, docs last):
- #57 Whisker gated ingress: ingress_factory (whisker.viktorbarzin.me,
auth=required, Authentik-gated) + a NetworkPolicy allowing traefik->whisker:8081
(the operator's whisker NP default-denies ingress). calico stack.
- #61 pipeline health: AggregatorDown + DigestFailing Prometheus alerts
(prometheus_chart_values.tpl) + cluster-health check #48.
- #59 service-identity labels on the multi-Service namespaces (monitoring's 5
TF-managed deployments + dbaas), with the KYVERNO_LIFECYCLE_V1 marker so they
update in-place.
- #62/#63 docs: docs/runbooks/goldmane-flow-trail.md (new), service-catalog,
security.md + monitoring.md east-west sections, ADR-0014 as-built, CONTEXT.md.
#62 = the SQL to derive the Wave-1 per-namespace egress allowlist from the
edge table (feeds code-8ywc; enforce-flips out of scope).
Also fixes the digest's Slack target: #security override 404s channel_not_found
because the shared alertmanager_slack_api_url webhook's app isn't a member of
#security (this likely also breaks alertmanager's slack-security receiver — flagged
in the runbook). Routed to #alerts (the webhook's working channel) until the app
is invited; verified a real digest run posts cleanly (360 edges).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to redesign the ha-london dashboards and fix the broken integrations (the Cowboy one). The skill's ha-london knowledge map had drifted badly from reality, so this brings it current: it claimed HA 2025.9.1 on a docker-run container (it's HAOS 2026.5.2, managed); listed the now-dead jdejaegh Cowboy integration with sensor.bike_* entities (revived via elsbrock/cowboy-ha v1.2.0 -> entities are sensor.classic_performance_*); and didn't flag that met/metoffice/roomba/hildebrandglow are user-disabled (not broken) or that Tapo P100 is failing. Also documents the redesigned Overview (Home+More sections, Mushroom+mini-graph-card), the dashboard/view/card glossary, the parked-bike 'unknown battery' gotcha, and that london is API-only from the Sofia devvm (no SSH).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 3 of the ESO 0.12->2.6 migration (the last k8s-1.35 compat-gate blocker).
Climbed external-secrets 0.16.2 -> 0.17.0 -> ... -> 2.6.0 one minor at a time,
each hop applied + verified (ES sync held at 109 Ready every hop; atomic=true
rollback safety net). Crossed the 0.17 cutoff (v1beta1 serving removed) only
after Phase 2 put all 104 ExternalSecrets + 2 ClusterSecretStores on
external-secrets.io/v1. Result: compat-gate now returns "OK: cluster is safe to
upgrade to 1.35.6" (EXIT 0) — the autonomous version-check chain will take k8s
1.34 -> 1.35 on its next nightly run.
Also fixes the repo-wide stale-lock issue that broke CI pipeline 332: the
terragrunt-generated providers.tf declares gavinbunney/kubectl + telmate/proxmox,
but ~28-39 stacks' committed .terraform.lock.hcl predated that ("Inconsistent
dependency lock file: no version selected"). Reconciled via `tg init -upgrade`
and committed so `terragrunt apply`/CI work cleanly again.
Docs: .claude/CLAUDE.md ESO line corrected (104 ESs, v1, chart 2.6.0); plan doc
marked COMPLETE.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The claude-memory MCP/plugin was uninstalled 2026-06-21 (recall now via the
homelab-memory-recall.py UserPromptSubmit hook; store/recall/update via the
`homelab memory` CLI, which hits the same remote HTTP API). Updates the
.claude/CLAUDE.md 'remember X' instruction off the obsolete local memory-tool
CLI + memory_search/memory_get onto the homelab CLI. Matches the root monorepo
CLAUDE.md + ~/.claude/rules/execution.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Traefik Yaegi CrowdSec bouncer plugin was dead on Traefik 3.7.5 (handler
never invoked) and has been removed. Document the replacement: in-kernel
nftables drop via cs-firewall-bouncer on direct hosts, and a Cloudflare IP-List
+ zone WAF block rule (fed by a LAPI->CF-list sync CronJob) on proxied hosts.
Both add zero per-request latency and fail open.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.
This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
allowlist comment claimed council-complaints as the last referencer;
rewrite it (no live workload pulls from that registry now; only stale
completed Job records still carry the ref). The allowlist line itself
is kept (registry-scoped, not app-specific).
Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The infra CI pipeline was failing often — ~38% of the last 50 runs didn't
succeed. The single biggest cause (8 of 19 non-successes) was Tier-1 stack
applies dying instantly with "Error acquiring the state lock".
Tier-0 stacks already degrade gracefully (Vault advisory lock → the pipeline
skips a locked stack). Tier-1 stacks have no such fallback: they rely on
terraform's pg-backend pg_advisory_lock, and scripts/tg ran terragrunt with
no -lock-timeout, so any concurrent lock holder was fatal — a Woodpecker-killed
run whose PG lock wasn't reaped yet (PL266 killed → PL267 failed the same
second), a human/agent applying locally, or the daily drift `plan`.
Fix: scripts/tg now passes -lock-timeout (default 5m, override TG_LOCK_TIMEOUT)
on every state-locking verb (plan/apply/destroy/refresh), so a contended lock
WAITS for the holder to finish instead of failing. -auto-approve behaviour for
non-interactive applies is unchanged. Central wrapper change → covers CI, plus
local human/agent applies; no CI image rebuild (tg is read from the repo).
Adds a hermetic pytest (stub terragrunt + preset PG_CONN_STR) pinning the
arg-injection. Docs updated in AGENTS.md + .claude/CLAUDE.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mined another devvm user's Claude sessions for repeated, hand-rolled command
patterns worth absorbing into the shared CLI. The dominant signal was Home
Assistant "Sofia" work: a `kubectl | base64 | jq` token-extraction pipeline
re-derived ~420x, and a bespoke non-interactive `ssh -o …` invocation reinvented
~30x — every session. The existing `home-assistant-sofia.py` already covers the
API but goes unused from an arbitrary cwd (needs an env var set + a cwd-relative
path), so agents bypassed it and hand-rolled everything.
Add two verbs covering exactly the gaps the `ha` MCP can't (entity state/control
stays with the MCP):
- `ha token [--instance sofia|london]` (read): resolves the long-lived API token
live from k8s secret openclaw/openclaw-secrets via the ambient kubeconfig — no
pre-set env var. Composes as `curl -H "Authorization: Bearer $(homelab ha token)"`.
- `ha ssh [--instance sofia|london] -- <cmd>` (write): deterministic
non-interactive ssh to the HA host using the invoking user's key.
Also fix the root cause: `home-assistant-sofia.py` now falls back to
`homelab ha token` when its env var is unset (works from any directory), and the
home-assistant skill points agents at these verbs + `homelab metrics query`
instead of hand-rolled curls. README + ADR-0012 + AGENTS.md updated per the
per-verb-group convention.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)
Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
(previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
(CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).
Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted people to be able to sign up with GitHub, not just the
native form or Authentik SSO.
- Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth
--provider github` (name "github", matching the callback registered on
the GitHub OAuth App). Like the existing Authentik source, it lives in
Forgejo's DB rather than Terraform — there's no clean TF resource for
login sources. Client id/secret mirrored to Vault secret/viktor
(forgejo_github_oauth_client_id / _secret) for recovery.
- This commit's TF change: ENABLE_AUTO_REGISTRATION=true in
[oauth2_client], so a first GitHub sign-in creates the account directly
("sign up with GitHub") instead of a link-to-existing detour. The
GitHub identity is the trust gate for this path; Turnstile + email
confirmation still gate the native form.
Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github
redirects to GitHub's authorize URL with the correct client id +
callback, and the login page renders the button. Final browser
click-through is the user's to do.
Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section +
secret-rotation + DB-loss recreate steps).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wants Forgejo open for anyone to sign up, but without bot/spam
account floods. Flip the deployment from OAuth-only registration
(ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local
sign-up, and add two bot gates on the registration form:
- Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget
is managed in Terraform (turnstile.tf) via the CF Global API key, so
the sitekey/secret are IaC, not a dashboard artifact.
- Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the
Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me
(mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced
credential Authentik uses (email-secret.tf ESO -> secret/authentik
smtp_password).
Existing Authentik OAuth2 login is unchanged (additive). Deployment env
appended (not inserted) so the diff stays purely additive; a reloader
annotation rolls the pod on secret rotation.
Verified live: signup page renders the Turnstile widget, mailer delivers
a test message end-to-end, Forgejo healthy, plan-to-zero after apply.
Runbook: docs/runbooks/forgejo-open-signups.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the new paperless-ai service and the two non-obvious operational
facts: runtime config lives in the PVC .env (not TF env, which would shadow
it), and Qwen3 needs /no_think for parseable tagging output.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Disruptive node drains should run when the cluster is idle. Move the
k8s-version-check detection CronJob from 12:00 UTC (noon) to 23:00 UTC
(00:00 London) — overnight, low usage, and clear of the kured OS-reboot window
(01:00-05:00 UTC) so the two drain pipelines never overlap. (Viktor, 2026-06-17.)
- stacks/k8s-version-upgrade/main.tf: var.schedule default 0 12 → 0 23 * * *.
- scripts/upgrade_state.sh: next_scheduled_run_utc now computes the 23:00 slot
(was next_daily_noon_utc).
- docs (runbook, architecture) + upgrade-state SKILL: schedule references
updated to 23:00 UTC nightly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.
Docs updated to match the behaviour change (per the same-commit docs rule):
- docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
"kill a stuck Job" recovery now leads with retry-on-failure self-heal.
- docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
retry-on-failure note on the deterministic-naming paragraph.
- .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
entry, and drill-down (also copied to the active ~/.claude copy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 4 docs for the enforcer -> gated-tracker change:
- runbook t3-version-bump.md: rewritten around the tracker — how each bump is
gated, plus freeze/revert/pin/dry-run/manual-rollback ops.
- post-mortem 2026-06-09: append the deliberate 2026-06-16 reversal and how the
gates close each named root-cause/lesson (historical sections left intact).
- service-catalog t3 row: "PINNED 0.0.24 enforcer" -> gated nightly tracker;
replace the stale "auto-pair 401-broken on 0.0.26" note (re-verified healthy
2026-06-16, cookieless -> 302 + t3_session).
- t3-provision-users.sh step 5b comment: enforcer -> tracker; note Persistent dropped.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Minted a dedicated classic GitHub PAT scoped to read:packages and stored it in
Vault secret/viktor/ghcr_pull_token (2026-06-15), replacing the previous alias
of the broad admin github_pat. Propagated via targeted apply of
module.kyverno.kubernetes_secret.ghcr_credentials (Kyverno re-syncs the
allowlisted namespaces). Document the new cred + the manual rotation recipe.
Closes: code-h2il
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Deploy a small stateless anisette-data server so the TripIt iOS Shell can be
sideloaded with SideStore using a free Apple ID, without brokering the
Apple-ID auth dance through a public third-party anisette server (which would
see every login). SideStore points at a stable internal endpoint we control.
- Image: Dadoum/anisette-v3-server, the de-facto standard anisette-v3 server
for SideStore/AltStore. Upstream ships only a mutable :latest (no GitHub
releases / semver / sha tags), so pinned by manifest digest instead of a tag
per the "never :latest" rule. Pulled from DockerHub via the registry-VM
pull-through cache like echo/cyberchef. Diun watches :latest (notify-only) so
a new upstream build prompts a digest re-pin.
- Stateless: emptyDir backs the provisioning-library cache dir (regenerable
download; upstream issue #23 means it doesn't preserve client auth across
restarts anyway) — no PVC, no Vault secret.
- Internal-only endpoint http://anisette.viktorbarzin.lan (auth=none,
allow_local_access_only, ssl_redirect off) — SideStore is a native client
that can't do the Authentik cookie dance, same reasoning as android-emulator's
adb. The .lan CNAME is auto-created by technitium-ingress-dns-sync; never
publicly exposed.
Mirrors the echo/networking-toolbox/android-emulator stack pattern. Service
catalog updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
k8s-portal was the last in-cluster image builder. Its .woodpecker/k8s-portal.yml
was deleted; it now builds on GHA (build-k8s-portal.yml) -> PRIVATE ghcr, pulled
via the Kyverno ghcr-credentials allowlist and deployed by Keel. Fix the CI/CD
section: drop k8s-portal from the Woodpecker-pipelines list (stale), move it from
'already on GHA' to the infra-owned private-ghcr images, and add it to the
PRIVATE ghcr allowlist roster. Completes the no-local-builds migration.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ADR-0002 is fully landed (issues #11-#32 closed): every owned image now
builds on GitHub Actions and pushes to ghcr.io/viktorbarzin/<name>, with
Woodpecker reduced to deploy-only. The Forgejo container registry is frozen
and emptied; there are no in-cluster image builds or CI test runs anywhere.
The docs still described the old hybrid topology (DockerHub builds,
Woodpecker-native owned-app builds, the per-pattern migration lists, the
tripit-only pilot framing), which would mislead future sessions and
incident response.
This brings the docs to the completed reality (closes#33):
- docs/architecture/ci-cd.md: full rewrite as the canonical CI/CD reference —
the fleet GHA->ghcr->Woodpecker-deploy pattern, public/private ghcr package
split, infra-owned image workflows (incl. infra-ci on ghcr), the frozen
Forgejo registry, what Woodpecker still runs, and the #31 decommissions.
- .claude/CLAUDE.md: rewrite the "CI/CD Architecture" section to the
fleet-wide final state; FIX the stale claim that claude-memory-mcp builds
to DockerHub (it is GHA->ghcr); note owned images now live on ghcr and the
Forgejo registry is frozen/break-glass near the image-registry bullet.
- .claude/reference/service-catalog.md: f1-stream is GHA->ghcr + Woodpecker
deploy-only (was "Woodpecker-native build->deploy").
- stacks/{tuya-bridge,android-emulator}/variables.tf + stacks/terminal/main.tf:
cosmetic description/comment updates (forgejo -> ghcr; terminal-lobby has no
CI pipeline). Description/comment text only — no stack logic changed.
Historical records (docs/post-mortems/*, docs/plans/*) and ADR-0002 itself
are left untouched as point-in-time records.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor's standing instruction (2026-06-12): lean on external infra as
much as possible for CI — builds, running tests, lint, releases all on
GitHub Actions hosted runners, never on cluster nodes; in-cluster
pipelines only for cluster-touching steps (deploys, terragrunt,
certbot). Also: watch any triggered pipeline chain to completion and
fix failures immediately. Added to AGENTS.md + .claude/CLAUDE.md
CI sections (ADR-0002 companions).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reviewed the last 24h of Slack alerts after the midday node-pressure blip:
the volume came far less from the outage than from (a) alerts re-pinging
every few hours while nothing changed and (b) a pod cascade that fired
uninhibited. This hardens the alerting *system* so recurrences are quiet,
rather than just clearing today's broken services.
Changes (all in the monitoring module):
* Alert-on-change routing. warning/info repeat_interval -> 8760h (notify
once, then only on a membership change or resolve); critical 1h -> 6h
(a slow nag, not an hourly drip). send_resolved stays on. The bulk of
the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired
continuously for ~24h, re-notifying every 4h).
* Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at
08:00 Europe/London: the full current board grouped by severity + what
resolved in the last 24h. This is the standing-state safety net for the
alert-on-change model. Stock python:3.12-alpine, pure-stdlib script
(no pip/apk at runtime -> none of the per-run disk-write footprint that
disabled status-page-pusher). Reuses the existing Alertmanager Slack
webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus.
* Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the
downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff,
PodsStuckContainerCreating, ScrapeTargetDown, *ReplicasMismatch, ...).
The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14
PodImagePullBackOff uninhibited because only NodeDown was a source.
* T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst
for the same leg — two alerts described one condition and were the #1
noise source (~3,400 alert-minutes over 24h).
* ScrapeTargetDown false positives. Scrape only Ready endpoints, so
completed CronJob pods that linger in EndpointSlices as NotReady
addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready
pod with a genuinely broken metrics endpoint still fires.
* for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/
NfsMirror/Vzdump *Failing) and DNS spike detectors, so a single
transient Pushgateway/scrape blip no longer fires-and-resolves.
* Added an Alertmanager scrape target: it carried no prometheus.io/scrape
annotation, so notification volume was unmeasurable — now we can verify
this change worked (alertmanager_notifications_total et al.).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Same-change doc sync for infra#12: the tripit-ns-scoped interim secret
paragraph described the pre-ClusterPolicy state.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The control-plane flap (etcd lease-renewal timeouts) recurred. Rather than move
etcd to SSD (code-oflt, deferred again), the chosen direction is to REDUCE etcd
load enough that the leader-election-timeout band-aid (renew 10s->30s) becomes
removable. These are the big, clean cuts:
1. Remove VPA/Goldilocks (stacks/vpa emptied). All 349 VPAs ran updateMode=Off
(no auto-right-sizing) yet cost ~800 etcd objects + continuous recommender
writes + a pod-creation admission webhook, purely to feed a dashboard. krr
(Dockerized, on-demand) replaces it. Reverses the re-add after memory 2431.
2. Disable kyverno reporting (admission/aggregate/background). policyReports were
already off, so the pipeline generated ephemeralreports + an hourly
all-resource etcd re-scan for NO user-facing output. Admission enforcement
(deny-* policies) and Keel mutation are unaffected; violations surface via
Loki->Slack.
3. descheduler */5 -> hourly (fewer list/evict cycles; rebalancing isn't urgent).
Deferred (poor ROI / unsafe as planned): ESO refreshInterval 15m->1h is a
~20-stack sprawl for ~0.1 writes/s; keel background=false is invalid for a
mutate-existing policy and its churn is apply-time not steady-state. Both filed
as follow-up beads.
Post-apply: delete the chart-orphaned VPA CRDs to cascade-clean leftover CRs.
Then measure etcd apply-latency and revert the timeouts. Docs updated
(VPA/Goldilocks -> krr). See memory 5402-5407.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor is setting up an Android app development pipeline (tripit is the
first app) and wants agents to natively test changes on Android before
shipping. This adds the testing environment: an API-36 Google emulator
under KVM as a privileged pod (namespace joins the Kyverno exclude list),
SDK/system-image/AVD on a proxmox-lvm PVC, adb on the shared MetalLB IP
10.0.20.200:5555 (LAN only), noVNC screen view at
android-emulator.viktorbarzin.lan. Image is built manually from the
stack's docker/ dir (rare rebuilds; off-infra-CI rule targets repeated
builds). First infra ADR records the trade-offs (devvm/VM/redroid/budtmo
rejected).
Viktor got locked out of the break-glass path (forgot the port-knock setup) and
deleted the edge-router forwards, then asked to review and redesign it from
scratch.
Root cause of the lockout: the knock added no real security (key-only SSH is
already brute-force-proof) and its only benefit — hiding the port — came at the
cost of a circular dependency. The knock sequence lived only in in-cluster
Vault, which is unreachable in the exact away/cold scenario break-glass exists
for. So the unlock secret was unavailable precisely when needed.
New model (self-contained, nothing to remember): plain key-only SSH on the
Proxmox host's :52222, openly reachable. The edge router forwards WAN tcp/52222
-> 192.168.1.127:52222 (external port MUST equal internal on the TP-Link AX6000
- it rejects remaps; port 22 itself is reserved). The exposed port trusts only a
dedicated break-glass key via `Match LocalPort` (a leak of any other root key
does not grant internet access), rate-limited (iptables hashlimit) + fail2ban.
- Removed knockd (package + config) and the legacy Synology SSH forward
(ext 3333 -> .13:22, a needless WAN exposure the original plan wanted gone).
- Fixed the fail2ban jail for Debian 13 (auth logs under sshd-session, not sshd
- the stock journalmatch silently never banned).
- Versioned the host config in scripts/ (it was applied ad-hoc, never committed)
and recorded the deliberate Wave-1 "no public-IP" exception in security.md +
.claude/CLAUDE.md. Superseded the 2026-05-30 port-knock design docs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The raw string compare never matched qm config's canonical key order, so
the hourly timer re-issued 'qm set' against every running capped VM,
live-rewriting QEMU throttle state via QMP 24x/day. Implicated in today's
devvm freeze (15:21-16:48 UTC): the guest's disk I/O stalled inside QEMU
(blockstats frozen at 0 while QMP stayed responsive) on the legacy lsi
controller path with no iothread.
Viktor asked to root-cause the freeze before choosing fixes, then approved
mitigating via VM settings: this commit fixes the hourly trigger and
documents the incident; the controller swap (virtio-scsi-single +
iothread=1 + aio=threads) is staged on VM 102 separately, pending his
cold stop/start.
Adds docs/post-mortems/2026-06-11-devvm-qemu-io-stall.md (evidence chain,
ruled-out causes, capture-before-kill autopsy steps) and syncs compute.md
+ proxmox-inventory.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to go through the agent's stored infra facts and straighten out anything wrong about what-is-where. Cross-checking docs against the live cluster surfaced doc drift alongside the stale memories:
- compute.md: add k8s-node5/6 (joined 2026-05-26) to diagram + node table; totals 48 vCPU / ~176GB -> 64 vCPU / ~240GB; cluster version v1.34.2 -> v1.34.8 (live-verified)
- storage.md: the nfs-proxmox StorageClass no longer exists (removed 2026-04-25, commit 484b4c71) — nfs-truenas is the only NFS SC; fixed three spots that told readers to use nfs-proxmox
- proxmox-inventory.md: k8s VM RAM rows live-verified via kubectl (master 32G, node1 48G, node2-4 32G — the old 16/32/24G figures predated the 2026-04-02 resize), added node5/6 rows, devvm swap 8G -> 14G (grown 2026-06-10), recomputed total (~288GB nominal of 272GB physical, overcommitted)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sets FARE_PROVIDER=playwright + FARE_CDP_URL on the tripit deployment so the planning workspace's flight_fare cells auto-fetch live Google Flights quotes through the existing in-cluster headed browser (tripit issue #18, ADR-0007 — rate-limited, cached, degrades to manual entry). Viktor asked to complete the trip-planning tickets; this is the infra leg of the fare-scrape slice. Docs: chrome-service architecture + service catalog updated (tripit is now the second active CDP caller; catalog's legacy :3000 WS pool line corrected to CDP :9222). HOLD-ORDER NOTE: pushed only after the tripit image containing FareMode.playwright rolled out (older images crash-loop on the unknown enum).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor asked to add connection logs (Traefik/Cloudflare) to catch the
real-path t3 WS drops: a direct-to-t3-serve browser ran 40 min clean
while real tunnel sessions cycle every 15-35s, so the drop originates
above t3-serve and we need to see which layer cuts the socket.
Traefik (/ws duration) and cloudflared (WS close events) already ship to
Loki; the gap was the devvm side. This adds:
- t3-dispatch logs every /ws open/close with dur_ms + cause:
downstream_closed (client/CF/Traefik hung up = last-mile/network),
upstream_closed (t3-serve closed/reset), or graceful. Graceful closes
previously left no trace (default ReverseProxy only logs on error), so a
watchdog-driven reconnect was invisible. Helpers unit-tested.
- devvm-promtail.{yaml,service}: ships devvm journald (t3-dispatch +
t3-serve@<user>) to cluster Loki as job=devvm-journal, mirroring the
pve/rpi-sofia shippers. devvm was never in Loki (standalone VM).
Joined in Loki the three layers attribute any future drop to a segment
with no repro needed. Runbook + service-catalog updated.
The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.
Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
pattern) so applies stop stripping live Keel state
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to review Authentik and the web tier and make first-time
signin to apps faster. Review found the slowness is screens and round
trips, not server time. Changes:
- values.yaml: the authentik.* Helm values (gunicorn workers, cache
timeouts, conn_max_age) were silently INERT because existingSecret
skips chart env rendering — pods ran defaults (2 workers, 300s
caches, no persistent DB conns). Moved all tuning into
server.env/worker.env, which actually reaches the pods.
- authentik_provider.tf: adopt the identification stage and pin
password_stage so username+password render on ONE screen (the
separate order-20 password binding is deleted via API — authentik
requires that when embedding). Outpost log_level trace->info and
1->2 replicas (it is on the hot path of every forward-auth request;
PG-backed sessions make 2 replicas safe).
- authentik module: /static ingress carve-out with immutable
Cache-Control (assets are version-fingerprinted but served with no
max-age — internal split-horizon users got zero caching).
- traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was
opening a fresh TCP connection to the outpost per subrequest) +
config-checksum annotation so config changes roll the pods.
- docs: authentication.md + authentik-state.md updated; fixed stale
'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md
(it is a live CNPG primary-selector compatibility service).
Done via API in the same change (UI-managed objects): 6 OIDC providers
(Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access)
switched from explicit to implicit consent — all first-party, the
4-weekly consent screen only slowed first-time signin.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Closes the loop on Viktor's ask to find the t3 disconnect root cause and
definitively rule infra in or out. Server logs alone cannot separate
'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve
stalled' — every cause collapses into the same 20s-watchdog reconnect.
The t3-probe (stacks/t3code) holds three permanent legs that differ only
in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN ->
CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to
the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve
process). Whichever leg drops convicts its segment; all legs clean while
a user drops exonerates infra with data. Dispatch gains an
unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket,
test-first) behind an auth=none path carve-out, guarded by the
authentik-walloff probe.
Also starts scraping devvm's node_exporter (job 'devvm') — it ran
unscraped, so the box whose memory/IO stalls cause the drops had zero
pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook
docs/runbooks/t3-drop-attribution.md.