Commit graph

20 commits

Author SHA1 Message Date
Viktor Barzin
e2146e6916 gpu: schedule off NFD label, not k8s-node1 hostname
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:

- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
  (frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
  audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
  auto-applied by gpu-feature-discovery on any node carrying an
  NVIDIA PCI device, so the selector follows the card.

- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
  nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
  each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
  'kubectl label gpu=true' since NFD handles labeling.

- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
  nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
  the GPU node) but portable when the card relocates.

Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.

Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
2026-04-22 13:43:07 +00:00
Viktor Barzin
5a0b24f54e [docs] TrueNAS decommission cleanup — remove references from active docs
TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been
served by Proxmox host (192.168.1.127) since. This commit scrubs remaining
references from active docs. VM 9000 itself remains on PVE in stopped state
pending user decision on deletion.

In-session cleanup already landed: reverse-proxy ingress + Cloudflare record
removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key}
purged; homepage_credentials.reverse_proxy.truenas_token removed;
truenas_homepage_token variable + module deleted; Loki + Dashy cleaned;
config.tfvars deprecated DNS lines removed; historical-name comment added to
the nfs-truenas StorageClass (48 bound PVs, immutable name — kept).

Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally
untouched — they describe state at a point in time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:55:43 +00:00
Viktor Barzin
a0d770d9a7 [cluster-health] Expand to 42 checks, remove pod CronJob path
- scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager
  readiness/expiry/requests, backup freshness per-DB/offsite/LVM,
  monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared
  +authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS
  to 42, add --no-fix flag.
- Remove the duplicate pod-version .claude/cluster-health.sh (1728
  lines) and the openclaw cluster_healthcheck CronJob (local CLI is
  now the single authoritative runner). Keep the healthcheck SA +
  Role + RoleBinding — still reused by task_processor CronJob.
- Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete
  the unused setup-monitoring.sh.
- Rewrite .claude/skills/cluster-health/SKILL.md: mandates running
  the script first, refreshes the 42-check table, drops stale
  CronJob/Slack/post-mortem sections, documents the monorepo-canonical
  + hardlink layout. File is hardlinked to
  /home/wizard/code/.claude/skills/cluster-health/SKILL.md for
  dual discovery.
- AGENTS.md + k8s-portal agent page: 25-check → 42-check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 15:13:03 +00:00
Viktor Barzin
702db75f84 [redis] Stabilise patch_redis_service trigger + document service naming
## Context

`null_resource.patch_redis_service` uses `triggers = { always = timestamp() }`,
so every `scripts/tg plan` on `stacks/redis` reports `1 to destroy, 1 to add`
even when nothing has changed. That noise hides real drift in the signal and
trains us to ignore redis-stack plans — which is exactly what you don't want
on a load-bearing patch.

The patch itself is still load-bearing (three consumers hard-code bare
`redis.redis.svc.cluster.local` — `stacks/immich/chart_values.tpl:12`,
`stacks/ytdlp/yt-highlights/app/main.py:136`, `config.tfvars:214` — plus
Bitnami's own sentinel scripts set `REDIS_SERVICE=redis.redis.svc.cluster.local`
and call it during pod startup). Removing the null_resource is a follow-up
(beads T0) once those consumers migrate to `redis-master.redis.svc`. For now
the goal is just: stop being noisy.

## This change

1. Replace the `always = timestamp()` trigger with two inputs that only change
   when re-patching is genuinely required:
   - `chart_version = helm_release.redis.version` — changes only on a Bitnami
     chart version bump, which is the one code path that rewrites the `redis`
     Service selector back to `component=node`.
   - `haproxy_config = sha256(kubernetes_config_map.haproxy.data["haproxy.cfg"])`
     — changes only when HAProxy config is edited; aligned with the existing
     `checksum/config` annotation that rolls the Deployment on config change.

   Both attributes are known at plan time (verified against `hashicorp/helm`
   v3.1.1 provider binary). Rejected alternatives — `metadata[0].revision`
   (not exposed in the plugin-framework v3 rewrite), `sha256(jsonencode(values))`
   (readability unverified on v3), and `kubernetes_deployment.haproxy.id`
   (static `namespace/name`, never changes) — don't meet the bar.

2. Add a **Redis Service Naming** section to `AGENTS.md` that explicitly
   states the write/sentinel/avoid endpoints, so new consumers start from
   `redis-master.redis.svc` (the documented `var.redis_host`) and long-lived
   connections (PUBSUB, BLPOP, Sidekiq) route around HAProxy's `timeout
   client 30s` via the sentinel headless path. Uptime Kuma's Redis monitor
   already learned that lesson the hard way (memory id=748).

## What is NOT in this change

- Deleting `null_resource.patch_redis_service` — still load-bearing (T0).
- Deleting `kubernetes_service.redis_master` — stays as the declared write API.
- Migrating any consumer off bare `redis.redis.svc` — T0 epic.
- Per-client sentinel migration — T1 epic.
- Retiring HAProxy — T2 epic (blocked on T1 + T3).

## Before / after

Before (steady state):
```
scripts/tg plan
Plan: 1 to add, 2 to change, 1 to destroy.
#   null_resource.patch_redis_service must be replaced
#     triggers = { "always" = "<timestamp>" } -> (known after apply)
```

After (steady state, post-apply):
```
scripts/tg plan
No changes. Your infrastructure matches the configuration.
```

After (chart version bump):
```
scripts/tg plan
#   null_resource.patch_redis_service must be replaced
#     triggers = { "chart_version" = "25.3.2" -> "25.4.0" }
```
— the trigger fires only when it actually needs to.

## Test Plan

### Automated

`scripts/tg plan` pre-change (confirms baseline noise):
```
# module.redis.null_resource.patch_redis_service must be replaced
-/+ resource "null_resource" "patch_redis_service" {
    ~ triggers = { # forces replacement
        ~ "always" = "2026-04-19T10:39:40Z" -> (known after apply)
      }
  }
Plan: 1 to add, 2 to change, 1 to destroy.
```

`scripts/tg plan` post-edit (confirms the one-time structural replacement):
```
# module.redis.null_resource.patch_redis_service must be replaced
-/+ resource "null_resource" "patch_redis_service" {
    ~ triggers = { # forces replacement
        - "always"         = "2026-04-19T10:39:40Z" -> null
        + "chart_version"  = "25.3.2"
        + "haproxy_config" = "989bca9483cb9f9942017320765ec0751ac8357ff447acc5ed11f0a14b609775"
      }
  }
```

Apply is deferred to the operator — the working tree on the same file also
contains an unrelated HAProxy DNS-resolvers fix (for today's immich outage)
that needs its own review before rolling out together. No `scripts/tg apply`
run from this session.

### Manual Verification

Reproduce locally:
1. `cd infra/stacks/redis && ../../scripts/tg plan`
2. Before apply: expect `null_resource.patch_redis_service` to be replaced
   exactly once, with the trigger map transitioning from `{always = <ts>}`
   to `{chart_version, haproxy_config}`.
3. After apply: `../../scripts/tg plan` twice in a row must both report
   `No changes.` (excluding unrelated drift from other work-in-progress).
4. Cluster-side invariant (must hold pre- and post-apply):
   `kubectl -n redis get svc redis -o jsonpath='{.spec.selector}'`
   → `{"app":"redis-haproxy"}`
   `kubectl -n redis get svc redis-master -o jsonpath='{.spec.selector}'`
   → `{"app":"redis-haproxy"}`
5. Regression test for the trigger doing its job: bump `helm_release.redis.version`
   in a branch, `tg plan`, expect the null_resource to replace. Revert.
2026-04-19 12:17:52 +00:00
Viktor Barzin
8a99be1194 [infra] Document HCL import {} block convention [ci skip]
## Context

Wave 8 of the state-drift consolidation plan — adopt the HCL `import {}`
block pattern (Terraform 1.5+) as the canonical way to bring live
cluster / Vault / Cloudflare resources under TF management.

Historically the repo has used `terraform import` on the CLI for
adoptions. That path has three real problems:

1. **Not reviewable** — it's an out-of-band state mutation that leaves
   no trace in git beyond the subsequent `resource {}` block. A
   reviewer sees only the new resource, not the adoption intent.
2. **Not plan-safe** — if the resource address or ID is wrong, the CLI
   path commits the mistake to state before anyone can catch it.
3. **Not idempotent** — a failed apply mid-import leaves state in a
   confusing half-adopted shape.

`import {}` blocks fix all three: the adoption intent is in the PR
diff, `scripts/tg plan` shows the import as its own plan line (mistyped
IDs fail before apply), and re-applying after a partial failure just
retries the import step.

Canonicalizing the pattern before Wave 5 (Calico + kured adoption) lands
so the reviewer of those imports has the rule in front of them.

## This change

- `AGENTS.md`: new "Adopting Existing Resources — Use `import {}` Blocks,
  Not the CLI" section sitting right after Execution. Includes the
  canonical 5-step workflow (write resource → add import stanza → plan
  to zero → apply → drop stanza), the reasoning, and a per-provider ID
  format table (helm_release, kubernetes_manifest, kubernetes_<kind>_v1,
  authentik_provider_proxy, cloudflare_record).
- `.claude/CLAUDE.md`: one-line cross-reference at the end of the
  Terraform State two-tier section pointing back to AGENTS.md. Keeps
  CLAUDE.md's quick-reference density intact while making sure the rule
  is reachable from the Claude-instructions path.

## What is NOT in this change

- Any actual imports — this is a pure docs landing. Wave 5 will
  demonstrate the pattern on kured + Calico.
- Replacing the handful of existing `terraform import`-style adoptions
  in the repo history — `import {}` blocks are delete-after-apply, so
  retro-documenting them is not useful.

Closes: code-[wave8-task]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:10:05 +00:00
Viktor Barzin
c9d221d578 [infra] Establish KYVERNO_LIFECYCLE_V1 drift-suppression convention [ci skip]
## Context

Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.

The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.

What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.

## This change

- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
  Attached inline on every `spec[0].template[0].spec[0].dns_config` line
  (or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
  existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
  `AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
  the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.

## What is NOT in this change

- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
  — that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
  save for the inline comment.
- The fallback module the original plan anticipated — skipped because
  Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
  (claude-agent-service, freedify/factory, hermes-agent). Reverted to
  keep this commit scoped to the convention rollout.

## Before / after

Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```

After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
  ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```

## Test Plan

### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
    | awk -F: '{s+=$2} END {print s}'
27

$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21

# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```

### Manual Verification

No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.

## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
   suppression.

Closes: code-28m

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:15:51 +00:00
Viktor Barzin
42f1c3cf4f [claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP
## Context

The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API
for running Claude headless agents. Three workflows still SSH'd to the DevVM
(10.0.10.10) to invoke `claude -p`. This eliminates that dependency.

## This change

Pipeline migrations (SSH → HTTP POST to claude-agent-service):
- `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead
  of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation
- `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON
  construction of TODO payloads
- `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install
- `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP
  Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault
  secret/n8n)

Documentation updates:
- `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s
- `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action
- `AGENTS.md` — pipeline description updated

## What is NOT in this change

- DevVM decommissioning (still hosts terminal/foolery services)
- Removal of SSH key secrets from Vault (kept for rollback)
- n8n workflow import (must be done manually in n8n UI)

[ci skip]

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-18 10:12:02 +00:00
Viktor Barzin
b1d152be1f [infra] Auto-create Cloudflare DNS records from ingress_factory
## Context

Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.

## This change:

- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
  modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
  the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
  `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
  cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
  dns_type. 17 hostnames remain centrally managed (Helm ingresses,
  special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.

```
BEFORE                          AFTER
config.tfvars (manual list)     stacks/<svc>/main.tf
        |                         module "ingress" {
        v                           dns_type = "proxied"
stacks/cloudflared/               }
  for_each = list                     |
  cloudflare_record               auto-creates
  tunnel per-hostname             cloudflare_record + annotation
```

## What is NOT in this change:

- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:45:04 +00:00
Viktor Barzin
c33f597111 feat(upgrade-agent): add automated service upgrade pipeline with n8n + DIUN
Pipeline: DIUN detects new image versions every 6h → webhook to n8n →
n8n filters (skip databases/custom/infra/:latest) and rate-limits
(max 5/6h) → SSH to dev VM → claude -p runs upgrade agent.

Agent workflow: resolve GitHub repo → fetch changelogs → classify risk
(SAFE/CAUTION) → backup DB if needed → bump version in .tf → commit+push
→ wait for CI → verify (pod ready + HTTP + Uptime Kuma) → rollback on
failure.

Changes:
- stacks/n8n: add N8N_PORT=5678 to fix K8s env var conflict
- stacks/n8n/workflows: version-controlled n8n workflow backup
- docs/architecture/automated-upgrades.md: full pipeline documentation
- AGENTS.md: add upgrade agent section
- service-catalog.md: update DIUN description

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:38:27 +00:00
Viktor Barzin
dcc96f465e docs(storage): add encrypted LVM documentation
Update storage docs to reflect the 2026-04-15 migration of all sensitive
services to proxmox-lvm-encrypted. Add encrypted PVC template, LUKS2 flow
documentation, updated architecture diagram, and storage class decision
rules.

Files updated:
- .claude/CLAUDE.md: storage decision table, encrypted PVC template
- docs/architecture/storage.md: encrypted flow, components, diagram, Vault paths
- AGENTS.md: storage section with encrypted SC as default for sensitive data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:00:37 +00:00
Viktor Barzin
1ef40daeec docs: update for MySQL 3→1, CrowdSec/Technitium PG migration, PG tuning, NFS async, node OS tuning [ci skip] 2026-04-13 23:05:46 +01:00
Viktor Barzin
82f674a0b4 rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]
Reflects the schedule change from weekly to daily. All references updated:
- scripts/weekly-backup.{sh,timer,service} → daily-backup.*
- Pushgateway job name: weekly-backup → daily-backup
- Prometheus metric names: weekly_backup_* → daily_backup_*
- All docs, runbooks, AGENTS.md, CLAUDE.md, proxmox-inventory
- offsite-sync dependency: After=daily-backup.service

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:37:04 +00:00
Viktor Barzin
b45cee5c4a docs: update backup architecture for inotify change tracking + consolidated Synology layout [ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:16:36 +00:00
Viktor Barzin
38d51ab0af deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]
- Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS
- Update config.tfvars nfs_server to 192.168.1.127 (Proxmox)
- Update nfs-csi StorageClass share to /srv/nfs
- Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP
- Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh)
- Rewrite nfs-health.sh for Proxmox NFS monitoring
- Update Freedify nfs_music_server default to Proxmox
- Mark CloudSync monitor CronJob as deprecated
- Update Prometheus alert summaries
- Update all architecture docs, AGENTS.md, and reference docs
- Zero PVs remain on TrueNAS — VM ready for decommission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:42:07 +00:00
Viktor Barzin
6ba4878f3a docs: update storage architecture for NFS migration to Proxmox host [ci skip] 2026-04-11 17:00:10 +01:00
Viktor Barzin
fc233bd27f docs: comprehensive audit and update of all architecture docs and runbooks [ci skip]
Audited 14 documentation files against live cluster state and Terraform code.

Architecture docs:
- databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity
  excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h),
  CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints
- overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage,
  correct Vault paths (secret/ not kv/)
- compute.md: 272GB physical host RAM, ~160GB allocated to VMs
- secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config
- networking.md: MetalLB pool 10.0.20.200-220
- ci-cd.md: 9 GHA projects, travel_blog 5.7GB

Runbooks:
- restore-mysql/postgresql: backup files are .sql.gz (not .sql)
- restore-vault: weekly backup (not daily), auto-unseal sidecar note
- restore-vaultwarden: PVC is proxmox (not iscsi)
- restore-full-cluster: updated node roles, removed trading

Reference docs:
- CLAUDE.md: 7-day rotation, removed trading from PG list
- AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell
- service-catalog.md: 6 new stacks, 14 stack column updates
2026-04-06 13:21:05 +03:00
Viktor Barzin
307b7f6819 update claude knowledge: infra operational learnings from commit history [ci skip]
Add resource management patterns, networking resilience, service-specific
notes, monitoring patterns, and NFS storage rules extracted from ~963 commits.
2026-03-15 10:46:45 +00:00
Viktor Barzin
2fa8ba2038 [ci skip] add sealed secrets convention: fileset + kubernetes_manifest pattern
- Document sealed secrets workflow in AGENTS.md and CLAUDE.md
- Add kubernetes_manifest + fileset(sealed-*.yaml) block to plotting-book as reference
- Users: kubeseal encrypt → commit sealed-*.yaml → CI applies via Terraform
- E2E tested: seal/commit/plan/apply/decrypt cycle verified
2026-03-08 20:03:50 +00:00
Viktor Barzin
9f2ac0fd1a [ci skip] update AGENTS.md + CLAUDE.md with SOPS workflow, add k8s-portal CI pipeline
AGENTS.md: added SOPS secrets management section, scripts/tg usage,
contributor onboarding steps, pull-through cache bypass notes.

CLAUDE.md: added SOPS workflow note, linux/amd64 build reminder,
versioned tag guidance for pull-through cache.

CI: new .woodpecker/k8s-portal.yml pipeline — auto-builds and deploys
the k8s portal when files under stacks/platform/modules/k8s-portal/files/
change on master push. Uses buildx for linux/amd64.
2026-03-07 15:37:19 +00:00
Viktor Barzin
8d3db35b5e [ci skip] add AGENTS.md for model-agnostic knowledge, slim CLAUDE.md to Claude-specific layer
AGENTS.md (63 lines): shared infra knowledge for any AI tool (Codex, Claude,
Cursor). Covers: critical rules, architecture, storage, tiers, common ops.

CLAUDE.md (23 lines): Claude-specific addons — skills, agents, user preferences.
References AGENTS.md for shared knowledge.

Removed generic agents (devops-engineer, fullstack-developer).
2026-03-06 23:50:26 +00:00