Compare commits
55 commits
5319f03ebc
...
43b4e1d372
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
43b4e1d372 | ||
|
|
81e7c3d6ee | ||
|
|
bde713f8a4 | ||
|
|
4f54c959d7 | ||
|
|
e1d20457c4 | ||
|
|
c9d221d578 | ||
|
|
a62b43d19e | ||
|
|
91165e31b9 | ||
|
|
82b7866bc9 | ||
|
|
9a2e920006 | ||
|
|
a24cf8c689 | ||
|
|
9ea7eec362 | ||
|
|
cacc282f1a | ||
|
|
b41528e564 | ||
|
|
6e19dce99e | ||
|
|
e4a96591b3 | ||
|
|
4eb68d6b1a | ||
|
|
c0ac24a54c | ||
|
|
3e11bd1b67 | ||
|
|
2fe3bb3307 | ||
|
|
dacf3d9e11 | ||
|
|
9ea4ccf17e | ||
|
|
7b88479278 | ||
|
|
8a42a1708d | ||
|
|
50dea8f0a7 | ||
|
|
8a05475218 | ||
|
|
50e8184d99 | ||
|
|
d3bdf87676 | ||
|
|
dad62647cd | ||
|
|
1de2ee307f | ||
|
|
903fc8377f | ||
|
|
0386f03f1a | ||
|
|
a12b06c608 | ||
|
|
57fdea4b99 | ||
|
|
7091ef2dd6 | ||
|
|
7b248897d3 | ||
|
|
c175cfd69b | ||
|
|
cc44bccfaa | ||
|
|
dbf7732a66 | ||
|
|
80b6591e8b | ||
|
|
69fbd0ffd6 | ||
|
|
99180bec42 | ||
|
|
42f1c3cf4f | ||
|
|
947f8ace54 | ||
|
|
99688bbb02 | ||
|
|
b30bfd4690 | ||
|
|
9780c04ca0 | ||
|
|
18338a883f | ||
|
|
2033e76798 | ||
|
|
b326c572a6 | ||
|
|
f6812fe69f | ||
|
|
842646ea4f | ||
|
|
65b0f30d5e | ||
|
|
4117809a54 | ||
|
|
498e7f3305 |
74 changed files with 4481 additions and 2603 deletions
|
|
@ -73,7 +73,7 @@ Violations cause state drift, which causes future applies to break or silently r
|
|||
- **LimitRange**: Tier-based defaults silently apply to pods with `resources: {}`. Always set explicit resources on containers needing more than defaults. Tier 3-edge and 4-aux now use Burstable QoS (request < limit) to reduce scheduler pressure.
|
||||
- **Democratic-CSI sidecars**: Must set explicit resources (32-80Mi) in Helm values — 17 sidecars default to 256Mi each via LimitRange. `csiProxy` is a TOP-LEVEL chart key, not nested under controller/node.
|
||||
- **ResourceQuota blocks rolling updates**: When quota is tight, scale to 0 then back to 1 instead of RollingUpdate. Or use Recreate strategy.
|
||||
- **Kyverno ndots drift**: Kyverno injects dns_config on all pods. Add `lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] }` to kubernetes_deployment resources to prevent perpetual TF plan drift.
|
||||
- **Kyverno ndots drift**: Kyverno injects dns_config on all pods. Every `kubernetes_deployment`, `kubernetes_stateful_set`, and `kubernetes_cron_job_v1` MUST include `lifecycle { ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1 }` (use `spec[0].job_template[0].spec[0].template[0].spec[0].dns_config` for CronJobs). The `# KYVERNO_LIFECYCLE_V1` marker is the canonical discoverability tag — grep for it to locate every site. A shared Terraform module was considered but `ignore_changes` only accepts static attribute paths (not module outputs, locals, or expressions), so the snippet convention is the only viable path. Full rationale and copy-paste snippets in `AGENTS.md` → "Kyverno Drift Suppression".
|
||||
- **NVIDIA GPU operator resources**: dcgm-exporter and cuda-validator resources configurable via `dcgmExporter.resources` and `validator.resources` in nvidia values.yaml.
|
||||
- **Pin database versions**: Disable Diun (image update monitoring) for MySQL, PostgreSQL, Redis.
|
||||
- **Quarterly right-sizing**: Check Goldilocks dashboard. Compare VPA upperBound to current request. Also check for under-provisioned (VPA upper > request x 0.8).
|
||||
|
|
@ -133,7 +133,7 @@ Repo IDs: infra=1, Website=2, finance=3, health=4, travel_blog=5, webhook-handle
|
|||
## Monitoring & Alerting
|
||||
- Alert cascade inhibitions: if node is down, suppress pod alerts on that node.
|
||||
- Exclude completed CronJob pods from "pod not ready" alerts.
|
||||
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns).
|
||||
- Every new service gets Prometheus scrape config + Uptime Kuma monitor. External monitors auto-created for Cloudflare-proxied services by `external-monitor-sync` CronJob (10min, uptime-kuma ns). Mechanism: `ingress_factory` auto-adds `uptime.viktorbarzin.me/external-monitor=true` whenever `dns_type != "none"` (see `modules/kubernetes/ingress_factory/main.tf`) — no manual action needed on new services. The `cloudflare_proxied_names` list in `config.tfvars` is a legacy fallback for the 17 hostnames not yet migrated to `ingress_factory` `dns_type`; don't check that list when debugging "is this monitored?" questions.
|
||||
- **External monitoring**: `[External] <service>` monitors in Uptime Kuma test full external path (DNS → Cloudflare → Tunnel → Traefik). Divergence metric `external_internal_divergence_count` → alert `ExternalAccessDivergence` (15min). Config: `stacks/uptime-kuma/`, targets from `cloudflare_proxied_names` in `config.tfvars` (17 remaining centrally-managed hostnames; most DNS records now auto-created by `ingress_factory` `dns_type` param).
|
||||
- Key alerts: OOMKill, pod replica mismatch, 4xx/5xx error rates, UPS battery, CPU temp, SSD writes, NFS responsiveness, ClusterMemoryRequestsHigh (>85%), ContainerNearOOM (>85% limit), PodUnschedulable, ExternalAccessDivergence.
|
||||
- **E2E email monitoring**: CronJob `email-roundtrip-monitor` (every 20 min) sends test email via Mailgun API to `smoke-test@viktorbarzin.me` (catch-all → `spam@`), verifies IMAP delivery, deletes test email, pushes metrics to Pushgateway + Uptime Kuma. Alerts: `EmailRoundtripFailing` (60m), `EmailRoundtripStale` (60m), `EmailRoundtripNeverRun` (60m). Outbound relay: Brevo EU (`smtp-relay.brevo.com:587`, 300/day free — migrated from Mailgun). Mailserver on dedicated MetalLB IP `10.0.20.202` with `externalTrafficPolicy: Local` for CrowdSec real-IP detection. Vault: `mailgun_api_key` in `secret/viktor` (probe), `brevo_api_key` in `secret/viktor` (relay).
|
||||
|
|
|
|||
169
.claude/agents/payslip-extractor.md
Normal file
169
.claude/agents/payslip-extractor.md
Normal file
|
|
@ -0,0 +1,169 @@
|
|||
---
|
||||
name: payslip-extractor
|
||||
description: "Extract structured UK payslip fields from a base64-encoded PDF into strict JSON."
|
||||
model: haiku
|
||||
allowedTools:
|
||||
- Bash
|
||||
- Read
|
||||
---
|
||||
|
||||
You are a headless payslip-field extractor. You receive a prompt containing a base64-encoded UK payslip PDF plus a target JSON schema, and you produce exactly one JSON object that matches the schema.
|
||||
|
||||
## Your single job
|
||||
|
||||
Given a prompt that contains:
|
||||
- A line of the form `PDF_BASE64: <base64-blob>`
|
||||
- A JSON schema describing the target fields
|
||||
|
||||
Produce EXACTLY ONE JSON object on stdout matching the schema. No prose. No markdown fences. No preamble. No trailing commentary. The final message content must be a single valid JSON object and nothing else.
|
||||
|
||||
## Processing steps
|
||||
|
||||
### Step 1. Extract and decode the base64 PDF
|
||||
|
||||
The prompt will include a line that starts with `PDF_BASE64:` followed by the base64 blob. Decode it to `/tmp/payslip.pdf`.
|
||||
|
||||
Preferred method (handles whitespace and very long blobs robustly):
|
||||
|
||||
```bash
|
||||
python3 - <<'PY'
|
||||
import base64, re, pathlib, sys, os
|
||||
prompt = os.environ.get("PAYSLIP_PROMPT", "")
|
||||
# If the orchestrator didn't set an env var, fall back to reading the transcript via CWD stdin mechanism.
|
||||
# In practice the agent receives the prompt in its conversation — you extract the PDF_BASE64 value
|
||||
# from the prompt text you were given, strip whitespace, and base64-decode.
|
||||
PY
|
||||
```
|
||||
|
||||
In practice: read the `PDF_BASE64:` value out of the prompt you have been given (you can see the full prompt), then run:
|
||||
|
||||
```bash
|
||||
python3 -c "
|
||||
import base64, sys
|
||||
data = sys.stdin.read().strip()
|
||||
open('/tmp/payslip.pdf','wb').write(base64.b64decode(data))
|
||||
print('decoded bytes:', len(base64.b64decode(data)))
|
||||
" <<'B64'
|
||||
<paste-the-base64-here>
|
||||
B64
|
||||
```
|
||||
|
||||
Or pipe via shell `base64 -d`:
|
||||
|
||||
```bash
|
||||
printf '%s' '<base64>' | base64 -d > /tmp/payslip.pdf
|
||||
```
|
||||
|
||||
Verify the file looks like a PDF:
|
||||
|
||||
```bash
|
||||
head -c 8 /tmp/payslip.pdf | xxd
|
||||
# Expected: 25 50 44 46 2d (i.e. "%PDF-")
|
||||
```
|
||||
|
||||
### Step 2. Extract text from the PDF
|
||||
|
||||
Try tools in this order. Use the first one that works; do not chain all of them.
|
||||
|
||||
1. `pdftotext` from `poppler-utils` (preferred — fastest, most reliable on layout-preserving payslips):
|
||||
```bash
|
||||
pdftotext -layout /tmp/payslip.pdf - 2>/dev/null
|
||||
```
|
||||
|
||||
2. Python `pypdf` fallback:
|
||||
```bash
|
||||
python3 -c "
|
||||
from pypdf import PdfReader
|
||||
r = PdfReader('/tmp/payslip.pdf')
|
||||
for p in r.pages:
|
||||
print(p.extract_text() or '')
|
||||
"
|
||||
```
|
||||
|
||||
3. Python `pdfplumber` fallback:
|
||||
```bash
|
||||
python3 -c "
|
||||
import pdfplumber
|
||||
with pdfplumber.open('/tmp/payslip.pdf') as pdf:
|
||||
for page in pdf.pages:
|
||||
print(page.extract_text() or '')
|
||||
"
|
||||
```
|
||||
|
||||
4. If none of those are installed, check what IS available:
|
||||
```bash
|
||||
which pdftotext pdf2txt.py mutool
|
||||
python3 -c "import pypdf, pdfplumber, pdfminer" 2>&1
|
||||
```
|
||||
and use whatever you find (e.g. `mutool draw -F txt`).
|
||||
|
||||
If every text-extraction tool fails, emit the failure JSON (see "Failure mode" below).
|
||||
|
||||
### Step 3. Parse the extracted text
|
||||
|
||||
UK payslips are laid out in a few common templates (Sage, Iris, QuickBooks, Xero, in-house ADP/Workday layouts). Common landmarks:
|
||||
|
||||
- "Pay Date" / "Payment Date" / "Date Paid" — the date wages hit the account. Usually at the top or in a header box.
|
||||
- "Tax Period" / "Period" / "Month" — e.g. "Month 1", "Week 12".
|
||||
- Two numeric columns per line: "This Period" (or "Amount", "Current") and "Year to Date" (or "YTD"). **Always take the This Period column**, never YTD.
|
||||
- Payments / Earnings block: "Basic Pay", "Salary", "Bonus", "Overtime", "Commission", "Holiday Pay".
|
||||
- Deductions block: "Income Tax" / "PAYE", "National Insurance" / "NI" / "NIC", "Pension" / "Pension Contribution" / "Salary Sacrifice Pension", "Student Loan" / "SL", optional: "Union Dues", "Charity", "Season Ticket Loan", "Private Medical", etc.
|
||||
- "Gross Pay" / "Total Gross" — sum of payments.
|
||||
- "Net Pay" / "Take Home" / "Amount Payable" — the money actually paid.
|
||||
- "Tax Code" — e.g. "1257L", "BR", "D0", "NT".
|
||||
- "NI Number" / "National Insurance Number" — `AA123456A` format. Never invent one.
|
||||
- "Employer" / "Company" — usually in the letterhead. "Employee" / "Name".
|
||||
- Currency: almost always GBP / "£" for UK payslips. If the PDF is not in GBP or not a UK payslip, still return the numbers as-is but include a best-effort `currency` field.
|
||||
|
||||
### Step 4. Map to the schema and emit JSON
|
||||
|
||||
Rules that apply regardless of the caller's exact schema:
|
||||
|
||||
- **Dates**: `pay_date` MUST be `YYYY-MM-DD`. If the PDF prints `12/03/2026`, interpret as `DD/MM/YYYY` (UK format) → `2026-03-12`. If ambiguous (`01/02/2026`), prefer UK ordering. If impossible to determine a year, use the pay_period year.
|
||||
- **Money fields**: emit as JSON numbers, not strings. Two decimal places are acceptable (`2450.17`). Strip `£`, commas, and trailing spaces. Negative values stay negative.
|
||||
- **Missing numeric fields**: emit `0` (zero), not `null`, not an empty string, not `"N/A"`.
|
||||
- **`other_deductions`**: an object mapping `{ "<label>": <number>, ... }` for any deduction that isn't one of the first-class fields in the schema (tax, NI, pension, student loan). Use the exact label from the payslip (e.g. `"Season Ticket Loan"`, `"Private Medical"`). If there are no other deductions, emit `{}` — NEVER `null` and NEVER omit the key.
|
||||
- **Column discipline**: ALWAYS use the "This Period" column, NEVER the YTD column. If only one column exists, that's the period column.
|
||||
- **Currency default**: `"GBP"` unless the payslip explicitly shows another currency symbol or ISO code.
|
||||
- **No invented data**: If a field genuinely isn't on the payslip, use the documented default (`0` for money, `""` for strings, `{}` for objects). Do NOT make up names, NI numbers, tax codes, or employers.
|
||||
|
||||
Follow the exact field names and types given in the prompt's schema. If the prompt's schema adds fields not listed above, produce them too using the same discipline.
|
||||
|
||||
## Failure mode
|
||||
|
||||
If the PDF cannot be read at all — unreadable base64, not a PDF, encrypted PDF with no text layer, no text-extraction tool available, or clearly not a UK payslip — emit a single JSON object:
|
||||
|
||||
```json
|
||||
{"error": "<short human reason>"}
|
||||
```
|
||||
|
||||
Examples of acceptable error reasons:
|
||||
- `"base64 did not decode to a valid PDF"`
|
||||
- `"pdf has no extractable text layer (image-only scan)"`
|
||||
- `"no pdf text extraction tool available (pdftotext/pypdf/pdfplumber all missing)"`
|
||||
- `"document does not appear to be a UK payslip"`
|
||||
- `"pay_date not found on document"`
|
||||
|
||||
The caller treats the `error` key as a non-retriable parse failure. Do not include any other keys when emitting an error object.
|
||||
|
||||
## Hard constraints — things you MUST NOT do
|
||||
|
||||
1. **No network calls.** Do not curl, wget, dig, or otherwise talk to the network. Everything you need is in the prompt.
|
||||
2. **No modifications to `/workspace/infra/**`.** Do not edit, write, or commit any file under the infra repo. The only file you may create is the scratch PDF at `/tmp/payslip.pdf` (and intermediate text dumps under `/tmp/`).
|
||||
3. **No git operations.** No `git add`, `git commit`, `git push`, nothing.
|
||||
4. **No kubectl, no terraform, no vault.** You are not an infra agent — you are a narrow extractor.
|
||||
5. **No markdown in output.** No ` ```json ` fences, no preamble like "Here's the extraction:", no trailing notes. The ENTIRE final assistant message is exactly one JSON object.
|
||||
6. **No verbose logging in the final message.** It is fine to run bash commands and see their output during processing, but your final assistant message is JSON and nothing else.
|
||||
7. **No hallucinated fields.** If the payslip does not show a pension line, do not invent one. Use the documented default instead.
|
||||
|
||||
## Output discipline — summary
|
||||
|
||||
- Exactly one JSON object, UTF-8, no BOM.
|
||||
- Keys match the schema the caller gave you.
|
||||
- Numeric fields are JSON numbers, not strings.
|
||||
- `pay_date` is `YYYY-MM-DD`.
|
||||
- `other_deductions` is always present and is an object (possibly `{}`).
|
||||
- Missing money → `0`, missing string → `""`, missing object → `{}`.
|
||||
- On unrecoverable failure, one JSON object with a single `error` key.
|
||||
|
||||
That's the whole job. Decode, extract, parse, emit JSON. Be boring and exact.
|
||||
|
|
@ -26,11 +26,12 @@ module "nfs_data" {
|
|||
## ~~iSCSI Storage~~ (REMOVED — replaced by proxmox-lvm)
|
||||
> iSCSI via democratic-csi and TrueNAS has been fully removed (2026-04). All database storage now uses `StorageClass: proxmox-lvm` (Proxmox CSI, LVM-thin hotplug). TrueNAS has been decommissioned.
|
||||
|
||||
## Anti-AI Scraping (5-Layer Defense)
|
||||
## Anti-AI Scraping (3 Active Layers) (Updated 2026-04-17)
|
||||
Default `anti_ai_scraping = true` in ingress_factory. Disable per-service: `anti_ai_scraping = false`.
|
||||
1. Bot blocking (ForwardAuth → poison-fountain) 2. X-Robots-Tag noai 3. Trap links before `</body>`
|
||||
4. Tarpit (~100 bytes/sec) 5. Poison content (CronJob every 6h, `--http1.1` required)
|
||||
Key files: `stacks/poison-fountain/`, `stacks/platform/modules/traefik/middleware.tf`
|
||||
1. Bot blocking (ForwardAuth → poison-fountain) 2. X-Robots-Tag noai 3. Tarpit/poison content (standalone at poison.viktorbarzin.me)
|
||||
Trap links (formerly layer 3) removed April 2026 — rewrite-body plugin broken on Traefik v3.6.12 (Yaegi bugs). `strip-accept-encoding` and `anti-ai-trap-links` middlewares deleted.
|
||||
Rybbit analytics injection now via Cloudflare Worker (`stacks/rybbit/worker/`, HTMLRewriter, wildcard route `*.viktorbarzin.me/*`, 28 site ID mappings).
|
||||
Key files: `stacks/poison-fountain/`, `stacks/rybbit/worker/`, `stacks/platform/modules/traefik/middleware.tf`
|
||||
|
||||
## Terragrunt Architecture
|
||||
- Root `terragrunt.hcl`: DRY providers, backend, variable loading, `generate "tiers"` block
|
||||
|
|
|
|||
|
|
@ -1,102 +0,0 @@
|
|||
# Setup Shared Remote Executor
|
||||
|
||||
Skill for setting up Claude Code's shared remote executor in new projects.
|
||||
|
||||
## When to Use
|
||||
- When adding Claude Code support to a new project
|
||||
- When the user says "set up remote executor for this project"
|
||||
- When working on a new project that needs remote command execution
|
||||
|
||||
## Prerequisites
|
||||
- Shared executor already deployed at `~/.claude/` on wizard@10.0.10.10
|
||||
- Project accessible via NFS from both macOS and the remote VM
|
||||
|
||||
## Setup Steps
|
||||
|
||||
### 1. Create .claude Directory
|
||||
```bash
|
||||
mkdir -p .claude/sessions
|
||||
```
|
||||
|
||||
### 2. Create session-exec.sh Wrapper
|
||||
Create `.claude/session-exec.sh` with the following content (adjust PROJECT_ROOT):
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Project-Local Session Helper - Wrapper for shared executor
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SHARED_SESSION_EXEC="/home/wizard/.claude/session-exec.sh"
|
||||
PROJECT_ROOT="/home/wizard/path/to/project" # UPDATE THIS
|
||||
|
||||
if [ -f "$SHARED_SESSION_EXEC" ]; then
|
||||
if [ "${1:-}" = "create" ] || [ -z "${1:-}" ]; then
|
||||
"$SHARED_SESSION_EXEC" create "$PROJECT_ROOT"
|
||||
else
|
||||
"$SHARED_SESSION_EXEC" "$@"
|
||||
fi
|
||||
else
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
SESSIONS_DIR="$SCRIPT_DIR/sessions"
|
||||
SESSION_ID="${1:-$(date +%s)-$$-$RANDOM}"
|
||||
ACTION="${2:-create}"
|
||||
SESSION_DIR="$SESSIONS_DIR/$SESSION_ID"
|
||||
|
||||
case "$ACTION" in
|
||||
create|init|"")
|
||||
mkdir -p "$SESSION_DIR"
|
||||
echo "ready" > "$SESSION_DIR/cmd_status.txt"
|
||||
echo "$PROJECT_ROOT" > "$SESSION_DIR/workdir.txt"
|
||||
> "$SESSION_DIR/cmd_input.txt"
|
||||
> "$SESSION_DIR/cmd_output.txt"
|
||||
echo "$SESSION_ID"
|
||||
;;
|
||||
cleanup|remove|delete)
|
||||
[ -d "$SESSION_DIR" ] && rm -rf "$SESSION_DIR"
|
||||
;;
|
||||
status)
|
||||
[ -d "$SESSION_DIR" ] && cat "$SESSION_DIR/cmd_status.txt"
|
||||
;;
|
||||
list)
|
||||
[ -d "$SESSIONS_DIR" ] && ls -1 "$SESSIONS_DIR" 2>/dev/null
|
||||
;;
|
||||
esac
|
||||
fi
|
||||
```
|
||||
|
||||
Make executable: `chmod +x .claude/session-exec.sh`
|
||||
|
||||
### 3. Link Sessions Directory (on remote VM)
|
||||
Run on the remote VM to add project sessions to the shared executor:
|
||||
|
||||
```bash
|
||||
# Option A: Symlink project sessions (if using project-local sessions)
|
||||
ln -sfn /path/to/project/.claude/sessions ~/.claude/sessions
|
||||
|
||||
# Option B: Use shared sessions (all projects share one directory)
|
||||
# Just ensure ~/.claude/sessions exists
|
||||
```
|
||||
|
||||
### 4. Create CLAUDE.md
|
||||
Add execution instructions to `.claude/CLAUDE.md`:
|
||||
|
||||
```markdown
|
||||
## Remote Command Execution
|
||||
Uses shared executor at `~/.claude/` on wizard@10.0.10.10.
|
||||
|
||||
### Usage
|
||||
\```bash
|
||||
SESSION_ID=$(.claude/session-exec.sh)
|
||||
echo "command" > .claude/sessions/$SESSION_ID/cmd_input.txt
|
||||
sleep 1 && cat .claude/sessions/$SESSION_ID/cmd_status.txt
|
||||
cat .claude/sessions/$SESSION_ID/cmd_output.txt
|
||||
\```
|
||||
|
||||
Start executor: `~/.claude/remote-executor.sh` (on remote VM)
|
||||
```
|
||||
|
||||
## Shared Executor Location
|
||||
- Scripts: `~/.claude/remote-executor.sh`, `~/.claude/session-exec.sh`
|
||||
- Sessions: `~/.claude/sessions/`
|
||||
- Remote VM: wizard@10.0.10.10
|
||||
|
|
@ -155,3 +155,19 @@ Common port is 80. Exceptions:
|
|||
3. Add `time.sleep(0.3)` between bulk operations to avoid overloading
|
||||
4. Homepage dashboard widget slug: `cluster-internal`
|
||||
5. Cloudflare-proxied at `uptime.viktorbarzin.me`
|
||||
|
||||
## Terraform-Managed Monitors
|
||||
|
||||
There is NO `louislam/uptime-kuma` Terraform provider. Two patterns exist for
|
||||
declarative monitor management in this stack:
|
||||
|
||||
- **External HTTPS monitors** — auto-discovered from ingress annotations by the
|
||||
`external-monitor-sync` CronJob (`*/10 * * * *`). Opt-out via
|
||||
`uptime.viktorbarzin.me/external-monitor: "false"` on the ingress.
|
||||
- **Internal monitors (DBs, non-HTTP)** — declared in the
|
||||
`local.internal_monitors` list in `stacks/uptime-kuma/modules/uptime-kuma/main.tf`
|
||||
and synced by the `internal-monitor-sync` CronJob. To add one, append to the
|
||||
list (provide `name`, `type`, `database_connection_string`,
|
||||
`database_password_vault_key`, `interval`, `retry_interval`, `max_retries`)
|
||||
and `scripts/tg apply`. The sync is idempotent — looks up by name, creates
|
||||
if missing, patches if drifted. Existing monitors keep their id and history.
|
||||
|
|
|
|||
5
.gitignore
vendored
5
.gitignore
vendored
|
|
@ -65,6 +65,11 @@ state/infra/
|
|||
backend.tf
|
||||
providers.tf
|
||||
.terraform.lock.hcl
|
||||
cloudflare_provider.tf
|
||||
tiers.tf
|
||||
stacks/*/cloudflare_provider.tf
|
||||
stacks/*/tiers.tf
|
||||
stacks/*/terragrunt_rendered.json
|
||||
|
||||
# Kubernetes config (sensitive)
|
||||
config
|
||||
|
|
|
|||
|
|
@ -9,52 +9,70 @@ clone:
|
|||
|
||||
steps:
|
||||
- name: run-issue-responder
|
||||
image: python:3.12-alpine
|
||||
image: alpine:3.20
|
||||
commands:
|
||||
- apk add --no-cache openssh-client curl jq
|
||||
- apk add --no-cache curl jq
|
||||
# Authenticate to Vault via K8s SA JWT
|
||||
- |
|
||||
SA_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
|
||||
VAULT_RESP=$(curl -sf -X POST http://vault-active.vault.svc.cluster.local:8200/v1/auth/kubernetes/login \
|
||||
-d "{\"role\":\"ci\",\"jwt\":\"$SA_TOKEN\"}")
|
||||
VAULT_TOKEN=$(echo "$VAULT_RESP" | jq -r .auth.client_token)
|
||||
if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
|
||||
-d "{\"role\":\"ci\",\"jwt\":\"$$SA_TOKEN\"}")
|
||||
VAULT_TOKEN=$(echo "$$VAULT_RESP" | jq -r .auth.client_token)
|
||||
if [ -z "$$VAULT_TOKEN" ] || [ "$$VAULT_TOKEN" = "null" ]; then
|
||||
echo "ERROR: Vault authentication failed"
|
||||
exit 1
|
||||
fi
|
||||
echo "Vault authenticated"
|
||||
# Fetch DevVM SSH key
|
||||
# Fetch API token for claude-agent-service
|
||||
- |
|
||||
curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
|
||||
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/ci/infra | \
|
||||
jq -r '.data.data.devvm_ssh_key' > /tmp/devvm-key
|
||||
chmod 600 /tmp/devvm-key
|
||||
if [ ! -s /tmp/devvm-key ]; then
|
||||
echo "ERROR: Failed to fetch DevVM SSH key"
|
||||
AGENT_TOKEN=$(curl -sf -H "X-Vault-Token: $$VAULT_TOKEN" \
|
||||
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/claude-agent-service | \
|
||||
jq -r '.data.data.api_bearer_token')
|
||||
if [ -z "$$AGENT_TOKEN" ] || [ "$$AGENT_TOKEN" = "null" ]; then
|
||||
echo "ERROR: Failed to fetch agent API token"
|
||||
exit 1
|
||||
fi
|
||||
echo "SSH key fetched"
|
||||
# SSH to DevVM and run issue-responder agent
|
||||
echo "Agent token fetched"
|
||||
# Submit job to claude-agent-service
|
||||
- |
|
||||
ISSUE_NUM="${ISSUE_NUMBER:-}"
|
||||
ISSUE_TITLE="${ISSUE_TITLE:-}"
|
||||
ISSUE_LABELS="${ISSUE_LABELS:-}"
|
||||
ISSUE_URL="${ISSUE_URL:-}"
|
||||
|
||||
if [ -z "$ISSUE_NUM" ]; then
|
||||
if [ -z "$$ISSUE_NUM" ]; then
|
||||
echo "ERROR: No issue number provided"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Processing issue #$ISSUE_NUM: $ISSUE_TITLE"
|
||||
echo "Labels: $ISSUE_LABELS"
|
||||
echo "Processing issue #$$ISSUE_NUM: $$ISSUE_TITLE"
|
||||
|
||||
ssh -i /tmp/devvm-key -o StrictHostKeyChecking=no wizard@10.0.10.10 \
|
||||
"cd ~/code && git -C infra stash && git -C infra pull --rebase && git -C infra stash pop 2>/dev/null; \
|
||||
~/.local/bin/claude -p \
|
||||
--agent infra/.claude/agents/issue-responder \
|
||||
--dangerously-skip-permissions \
|
||||
--max-budget-usd 10 \
|
||||
'Process GitHub Issue #${ISSUE_NUM}: ${ISSUE_TITLE}. Labels: ${ISSUE_LABELS}. URL: ${ISSUE_URL}. Read the issue body via GitHub API, investigate, and take appropriate action.'"
|
||||
# Cleanup
|
||||
- rm -f /tmp/devvm-key
|
||||
PAYLOAD=$(jq -n \
|
||||
--arg prompt "Process GitHub Issue #$$ISSUE_NUM: $$ISSUE_TITLE. Labels: $$ISSUE_LABELS. URL: $$ISSUE_URL. Read the issue body via GitHub API, investigate, and take appropriate action." \
|
||||
--arg agent ".claude/agents/issue-responder" \
|
||||
'{prompt: $prompt, agent: $agent, max_budget_usd: 10, timeout_seconds: 1800}')
|
||||
|
||||
RESP=$(curl -sf -X POST \
|
||||
-H "Authorization: Bearer $$AGENT_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "$$PAYLOAD" \
|
||||
http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
|
||||
|
||||
JOB_ID=$(echo "$$RESP" | jq -r '.job_id')
|
||||
echo "Job submitted: $$JOB_ID"
|
||||
# Poll for completion (30min max)
|
||||
- |
|
||||
for i in $(seq 1 120); do
|
||||
sleep 15
|
||||
RESULT=$(curl -sf \
|
||||
-H "Authorization: Bearer $$AGENT_TOKEN" \
|
||||
http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$$JOB_ID)
|
||||
STATUS=$(echo "$$RESULT" | jq -r '.status')
|
||||
echo "[$$i/120] Status: $$STATUS"
|
||||
if [ "$$STATUS" != "running" ]; then
|
||||
echo "$$RESULT" | jq .
|
||||
if [ "$$STATUS" = "completed" ]; then exit 0; else exit 1; fi
|
||||
fi
|
||||
done
|
||||
echo "ERROR: Job timed out after 30 minutes"
|
||||
exit 1
|
||||
|
|
|
|||
|
|
@ -17,7 +17,7 @@ steps:
|
|||
- name: parse-and-implement
|
||||
image: python:3.12-alpine
|
||||
commands:
|
||||
- apk add --no-cache jq curl git openssh-client
|
||||
- apk add --no-cache jq curl git
|
||||
- sh scripts/postmortem-pipeline.sh
|
||||
|
||||
- name: notify-slack
|
||||
|
|
|
|||
24
AGENTS.md
24
AGENTS.md
|
|
@ -75,6 +75,28 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
|
|||
## Shared Variables (never hardcode)
|
||||
`var.nfs_server` (192.168.1.127), `var.redis_host`, `var.postgresql_host`, `var.mysql_host`, `var.ollama_host`, `var.mail_host`
|
||||
|
||||
## Kyverno Drift Suppression (`# KYVERNO_LIFECYCLE_V1`)
|
||||
|
||||
Kyverno's admission webhook mutates every pod with a `dns_config { option { name = "ndots"; value = "2" } }` block (fixes NxDomain search-domain floods — see `k8s-ndots-search-domain-nxdomain-flood` skill). Terraform does not manage that field, so without suppression every pod-owning resource shows perpetual `spec[0].template[0].spec[0].dns_config` drift.
|
||||
|
||||
**Rule**: every `kubernetes_deployment`, `kubernetes_stateful_set`, `kubernetes_daemon_set`, and `kubernetes_cron_job_v1` MUST include the following `lifecycle` block, tagged with the `# KYVERNO_LIFECYCLE_V1` marker so every site is greppable:
|
||||
|
||||
```hcl
|
||||
# kubernetes_deployment / kubernetes_stateful_set / kubernetes_daemon_set
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
|
||||
# kubernetes_cron_job_v1 (extra job_template nesting)
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
```
|
||||
|
||||
**Why not a shared module?** Terraform's `ignore_changes` meta-argument only accepts static attribute paths. It rejects module outputs, locals, variables, and any expression. A DRY module is therefore impossible — the canonical pattern IS the snippet + marker. When `kubernetes_manifest` resources get Kyverno `generate.kyverno.io/*` annotations mutated, a sibling convention `# KYVERNO_MANIFEST_V1` will be introduced (Phase B).
|
||||
|
||||
**Audit**: `rg "KYVERNO_LIFECYCLE_V1" stacks/ | wc -l` — should grow (never shrink). Add the marker to every new pod-owning resource. The `_template/main.tf.example` stub shows the canonical form.
|
||||
|
||||
## Tier System
|
||||
`0-core` | `1-cluster` | `2-gpu` | `3-edge` | `4-aux` — Kyverno auto-generates LimitRange + ResourceQuota per namespace based on tier label.
|
||||
- Containers without explicit `resources {}` get default limits (256Mi for edge/aux — causes OOMKill for heavy apps)
|
||||
|
|
@ -105,7 +127,7 @@ Terragrunt-based homelab managing a Kubernetes cluster (5 nodes, v1.34.2) on Pro
|
|||
- **NFS exports**: Create dir on Proxmox host (`ssh root@192.168.1.127 "mkdir -p /srv/nfs/<service>"`), add to `/etc/exports`, run `exportfs -ra`.
|
||||
|
||||
## Automated Service Upgrades
|
||||
- **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → SSH → `claude -p` (upgrade agent)
|
||||
- **Pipeline**: DIUN (detect) → n8n webhook (filter + rate limit) → HTTP POST → `claude-agent-service` (K8s) → `claude -p` (upgrade agent)
|
||||
- **Agent**: `.claude/agents/service-upgrade.md` — analyzes changelogs, backs up DBs, bumps versions, verifies health, rolls back on failure
|
||||
- **Config**: `.claude/reference/upgrade-config.json` — GitHub repo mappings, DB-backed services, skip patterns
|
||||
- **Rate limit**: Max 5 upgrades per 6h DIUN scan cycle (configured in n8n workflow)
|
||||
|
|
|
|||
BIN
config.tfvars
BIN
config.tfvars
Binary file not shown.
|
|
@ -16,10 +16,10 @@ n8n Webhook (POST /webhook/<uuid>)
|
|||
│ rate limit: max 5 upgrades per 6h window
|
||||
│
|
||||
▼
|
||||
SSH → Dev VM (10.0.10.10)
|
||||
HTTP POST → claude-agent-service (K8s)
|
||||
│
|
||||
▼
|
||||
claude -p "upgrade agent prompt"
|
||||
claude -p "upgrade agent prompt" (in-cluster)
|
||||
│
|
||||
▼
|
||||
Service Upgrade Agent
|
||||
|
|
@ -54,7 +54,7 @@ Service Upgrade Agent
|
|||
- Only `status=update` (skip `new`, `unchanged`)
|
||||
- Skip databases, custom images, infra images, `:latest`
|
||||
- **Rate limiting**: Max 5 upgrades per 6-hour window using `$getWorkflowStaticData('global')`
|
||||
- **Action**: SSH to dev VM, runs `claude -p` with the upgrade agent prompt
|
||||
- **Action**: HTTP POST to `claude-agent-service.claude-agent.svc:8080/execute` with the upgrade agent prompt
|
||||
|
||||
### Upgrade Agent
|
||||
- **Prompt**: `.claude/agents/service-upgrade.md`
|
||||
|
|
@ -173,7 +173,35 @@ Key behaviors observed:
|
|||
| Secret | Vault Path | Purpose |
|
||||
|--------|-----------|---------|
|
||||
| n8n webhook URL | `secret/diun` → `n8n_webhook_url` | DIUN → n8n trigger |
|
||||
| Agent API bearer token | `secret/claude-agent-service` → `api_bearer_token` | n8n → claude-agent-service `/execute` auth. Synced into both `claude-agent` ns (consumer) and `n8n` ns (caller) via ESO. n8n exposes it to the container as `CLAUDE_AGENT_API_TOKEN` env var. |
|
||||
| Claude OAuth (primary) | `secret/claude-agent-service` → `claude_oauth_token` | Long-lived 1-year token from `claude setup-token`. Consumed by the CLI via `CLAUDE_CODE_OAUTH_TOKEN` env var (set on the container via `envFrom`). Preferred over the short-lived `.credentials.json` — CLI skips the refresh dance entirely. Rotate yearly; alert fires 30d out. |
|
||||
| Claude OAuth (spares) | `secret/claude-agent-service-spare-{1,2}` → `claude_oauth_token` | Failover tokens. Minted alongside primary (verified Anthropic does NOT revoke earlier sessions on new mint). Swap into primary if revocation or compromise. |
|
||||
| GitHub PAT | `secret/viktor` → `github_pat` | Changelog fetch (5000 req/hr) |
|
||||
| Slack webhook | `secret/platform` → `alertmanager_slack_api_url` | Upgrade notifications |
|
||||
| Woodpecker token | `secret/viktor` → `woodpecker_token` | CI pipeline polling |
|
||||
| Dev VM SSH key | n8n credentials store → `devvm-ssh` | n8n → dev VM SSH |
|
||||
|
||||
## OAuth token lifecycle
|
||||
|
||||
The CLI supports two auth modes. We use the second — long-lived.
|
||||
|
||||
| Mode | How minted | TTL | Needs refresh? | When to use |
|
||||
|------|-----------|-----|----------------|-------------|
|
||||
| `claude login` → `.credentials.json` | Interactive browser OAuth | Access ~6h + refresh token | Yes — CLI auto-refreshes on startup if refresh token valid | Human dev machines |
|
||||
| `claude setup-token` → opaque `sk-ant-oat01-*` | Interactive browser OAuth | **1 year** | No — expires hard | **Headless / service accounts (us)** |
|
||||
|
||||
When both are present on disk, `CLAUDE_CODE_OAUTH_TOKEN` env var wins.
|
||||
|
||||
**Harvesting headless**: `setup-token` uses Ink (React for terminals) and needs a real PTY with **≥300-column width**. At 80-col, Ink wraps and DROPS one character at the wrap boundary (107-char invalid instead of 108-char valid). Python wrapper pattern documented in memory; we harvested 2 spare tokens into Vault on 2026-04-18 using a temporary harvester pod.
|
||||
|
||||
**Monitoring**: CronJob `claude-oauth-expiry-monitor` (claude-agent ns, every 6h) pushes `claude_oauth_token_expiry_timestamp{path="..."}` to Pushgateway. Alerts: `ClaudeOAuthTokenExpiringSoon` (30d, warn), `ClaudeOAuthTokenCritical` (7d, crit), `ClaudeOAuthTokenMonitorStale` (48h no push, warn), `ClaudeOAuthTokenMonitorNeverRun` (metric absent, warn).
|
||||
|
||||
**Rotation**: on alert, harvest a new token, `vault kv patch secret/claude-agent-service claude_oauth_token=<new>`, update the `claude_oauth_token_mint_epochs` local in `stacks/claude-agent-service/main.tf`, `scripts/tg apply` → alert clears on next cron tick.
|
||||
|
||||
## n8n workflow gotchas
|
||||
|
||||
The `DIUN Upgrade Agent` workflow is imported once into n8n's PG DB — it is **not** Terraform-managed. The JSON at `stacks/n8n/workflows/diun-upgrade.json` is a backup; the live state lives in `workflow_entity.nodes`. Drift between the two is possible.
|
||||
|
||||
- **HTTP Request node header expressions must use template-literal form**: `=Bearer {{ $env.CLAUDE_AGENT_API_TOKEN }}` works; `='Bearer ' + $env.CLAUDE_AGENT_API_TOKEN` does NOT evaluate and sends an empty/bogus header → 401 from claude-agent-service.
|
||||
- **`N8N_BLOCK_ENV_ACCESS_IN_NODE=false`** must be set on the n8n deployment for expressions to read `$env.*` at all.
|
||||
- **Troubleshooting 401**: the workflow will show `success` status on the webhook node but error on `Run Upgrade Agent`. Inspect in n8n UI → Executions, or query `execution_entity` + `execution_data` directly. Claude-agent-service logs will also show `POST /execute HTTP/1.1 401 Unauthorized`.
|
||||
- **Patching the live workflow** (one-off, since it's not in TF): `UPDATE workflow_entity SET nodes = REPLACE(nodes::text, OLD, NEW)::json WHERE name = 'DIUN Upgrade Agent';`
|
||||
|
|
|
|||
|
|
@ -178,11 +178,11 @@ flowchart LR
|
|||
subgraph "Kubernetes Cluster"
|
||||
C -->|Yes| D[Woodpecker Pipeline]
|
||||
D --> E[Vault Auth<br/>K8s SA JWT]
|
||||
E --> F[Fetch SSH Key]
|
||||
E --> F[Fetch API Token]
|
||||
end
|
||||
|
||||
subgraph "DevVM (10.0.10.10)"
|
||||
F --> G[SSH + Claude Code]
|
||||
subgraph "claude-agent-service (K8s)"
|
||||
F --> G[HTTP POST /execute]
|
||||
G --> H[issue-responder agent]
|
||||
H --> I[Investigate / Implement]
|
||||
I --> J[Comment on Issue]
|
||||
|
|
|
|||
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
## Overview
|
||||
|
||||
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 5-layer anti-AI scraping defense. All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
|
||||
The homelab implements defense-in-depth security at the application layer (L7) using CrowdSec for threat intelligence and IP reputation, Kyverno for policy enforcement and resource governance, and a 3-layer anti-AI scraping defense (reduced from 5 in April 2026 after removing the rewrite-body plugin). All security components operate in graceful degradation mode (fail-open) to prevent cascading failures. Security policies are deployed in audit mode first, then selectively enforced after validation.
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
|
|
@ -59,7 +59,7 @@ Every incoming request passes through 6 security layers:
|
|||
1. **Cloudflare WAF** - DDoS protection, bot detection, firewall rules (external)
|
||||
2. **Cloudflared Tunnel** - Zero Trust tunnel, hides origin IP
|
||||
3. **CrowdSec Bouncer** - IP reputation check against LAPI (fail-open on error)
|
||||
4. **Anti-AI Scraping** - 5-layer bot defense (optional per service)
|
||||
4. **Anti-AI Scraping** - 3-layer bot defense (optional per service, updated 2026-04-17)
|
||||
5. **Authentik ForwardAuth** - Authentication check (if `protected = true`)
|
||||
6. **Rate Limiting** - Per-source IP rate limits (returns 429 on breach)
|
||||
7. **Retry Middleware** - Auto-retry on transient errors (2 attempts, 100ms delay)
|
||||
|
|
@ -131,10 +131,12 @@ This prevents resource exhaustion and enforces governance without manual quota m
|
|||
| `sync-tier-label` | Propagate tier label to child resources | Enforce |
|
||||
| `goldilocks-vpa-auto-mode` | Disable VPA globally (VPA off) | Enforce |
|
||||
|
||||
### Anti-AI Scraping (5-Layer Defense)
|
||||
### Anti-AI Scraping (3 Active Layers) (Updated 2026-04-17)
|
||||
|
||||
Enabled by default via `ingress_factory` module. Disable per-service with `anti_ai_scraping = false`.
|
||||
|
||||
Active middleware chain: `ai-bot-block` (ForwardAuth) + `anti-ai-headers` (X-Robots-Tag). The `strip-accept-encoding` and `anti-ai-trap-links` middlewares were removed in April 2026 due to Traefik v3.6.12 Yaegi plugin incompatibility with the rewrite-body plugin.
|
||||
|
||||
#### Layer 1: Bot Blocking (ForwardAuth)
|
||||
|
||||
- Middleware calls `poison-fountain` service before backend
|
||||
|
|
@ -148,25 +150,16 @@ Enabled by default via `ingress_factory` module. Disable per-service with `anti_
|
|||
- Instructs compliant bots to skip content
|
||||
- Lightweight, no performance impact
|
||||
|
||||
#### Layer 3: Trap Links
|
||||
#### ~~Layer 3: Trap Links~~ (REMOVED)
|
||||
|
||||
- JavaScript injects invisible links before `</body>`
|
||||
- Links point to honeypot endpoints
|
||||
- Legitimate browsers don't click, bots follow
|
||||
- Triggered bots get added to ban list
|
||||
Removed April 2026. The rewrite-body Traefik plugin used to inject hidden trap links broke on Traefik v3.6.12 due to Yaegi runtime bugs. The companion `strip-accept-encoding` middleware was also removed.
|
||||
|
||||
#### Layer 4: Tarpit
|
||||
|
||||
- Serves AI bots extremely slowly (~100 bytes/sec)
|
||||
- Wastes bot resources, makes scraping uneconomical
|
||||
- Humans see normal speed (only applies to detected bots)
|
||||
|
||||
#### Layer 5: Poison Content
|
||||
#### Layer 3 (formerly 4): Tarpit / Poison Content
|
||||
|
||||
- `poison-fountain` service still exists as a standalone service at `poison.viktorbarzin.me`
|
||||
- Serves AI bots extremely slowly (~100 bytes/sec tarpit)
|
||||
- CronJob every 6 hours generates fake content
|
||||
- Injects misleading/nonsense data into pages shown to bots
|
||||
- Degrades AI training data quality
|
||||
- **Requires `--http1.1` flag** to work with current HTTP/2 setup
|
||||
- Trap links are no longer injected into real pages, but bots that discover `poison.viktorbarzin.me` directly still get tarpitted and poisoned
|
||||
|
||||
**Implementation**: See `stacks/poison-fountain/` and `stacks/platform/modules/traefik/middleware.tf`
|
||||
|
||||
|
|
@ -286,13 +279,13 @@ spec:
|
|||
- **Better observability**: Collect violation metrics before enforcing
|
||||
- **Selective enforcement**: Move to enforce mode per-policy after validation
|
||||
|
||||
### Why 5-Layer Anti-AI Defense?
|
||||
### Why Multi-Layer Anti-AI Defense? (Updated 2026-04-17)
|
||||
|
||||
- **Defense in depth**: Each layer catches different bot types
|
||||
- **Compliant bots**: Layer 2 (X-Robots-Tag) handles respectful crawlers
|
||||
- **Dumb bots**: Layer 3 (trap links) catches simple scrapers
|
||||
- **Persistent bots**: Layer 4 (tarpit) makes scraping uneconomical
|
||||
- **Sophisticated bots**: Layer 5 (poison content) degrades training data
|
||||
- **Persistent bots**: Tarpit makes scraping uneconomical
|
||||
- **Poison content**: Degrades training data for bots that reach poison-fountain
|
||||
- Layer 3 (trap links via rewrite-body) was removed due to Traefik v3 plugin incompatibility
|
||||
|
||||
### Why Fail-Open Mode?
|
||||
|
||||
|
|
@ -382,15 +375,16 @@ spec:
|
|||
2. Verify backend isn't returning transient errors: Check for 5xx responses
|
||||
3. Disable retry for specific service: Remove retry middleware from `ingress_factory`
|
||||
|
||||
### Poison Content Not Injecting
|
||||
### Poison Content Not Serving (Updated 2026-04-17)
|
||||
|
||||
**Problem**: Bots not receiving poisoned content.
|
||||
**Problem**: Bots not receiving poisoned content on `poison.viktorbarzin.me`.
|
||||
|
||||
**Note**: Poison content is no longer injected into real pages (rewrite-body removed). It is only served directly via the `poison.viktorbarzin.me` subdomain.
|
||||
|
||||
**Fix**:
|
||||
1. Verify CronJob running: `kubectl get cronjob -n poison-fountain`
|
||||
2. Check logs: `kubectl logs -n poison-fountain -l app=poison-content-injector`
|
||||
3. Ensure `--http1.1` flag set (required for HTTP/2 backends)
|
||||
4. Manually trigger: `kubectl create job --from=cronjob/poison-content manual-poison`
|
||||
2. Check logs: `kubectl logs -n poison-fountain -l app=poison-fountain`
|
||||
3. Manually trigger: `kubectl create job --from=cronjob/poison-content manual-poison`
|
||||
|
||||
## Related
|
||||
|
||||
|
|
|
|||
|
|
@ -1,5 +1,7 @@
|
|||
# Anti-AI Scraping System Design
|
||||
|
||||
> **Status (Updated 2026-04-17):** Partially superseded. Layer 3 (trap links via rewrite-body plugin) removed due to Traefik v3.6.12 Yaegi plugin incompatibility. The `strip-accept-encoding` and `anti-ai-trap-links` middlewares have been deleted. Rybbit analytics injection moved from Traefik rewrite-body to a Cloudflare Worker (`infra/stacks/rybbit/worker/`). Active layers: 1 (bot-block), 2 (headers), 4 (tarpit), 5 (poison content).
|
||||
|
||||
## Problem
|
||||
|
||||
AI scrapers crawl public web services to harvest training data. We want to:
|
||||
|
|
@ -9,7 +11,7 @@ AI scrapers crawl public web services to harvest training data. We want to:
|
|||
|
||||
## Architecture
|
||||
|
||||
Five defense layers applied to all public services via Traefik:
|
||||
Four active defense layers applied to all public services via Traefik (Layer 3 removed April 2026):
|
||||
|
||||
```
|
||||
Internet -> Cloudflare -> Traefik
|
||||
|
|
@ -18,7 +20,7 @@ Internet -> Cloudflare -> Traefik
|
|||
|
|
||||
+-- Layer 2: Headers -> X-Robots-Tag: noai, noimageai
|
||||
|
|
||||
+-- Layer 3: Rewrite-body -> inject hidden trap links into HTML
|
||||
+-- [REMOVED] Layer 3: Rewrite-body trap links (April 2026 — Yaegi bugs in Traefik v3.6.12)
|
||||
|
|
||||
+-- Layer 4: Poison service -> serve cached Poison Fountain data
|
||||
|
|
||||
|
|
@ -68,13 +70,10 @@ All defined in `stacks/platform/modules/traefik/middleware.tf`:
|
|||
- Sets `X-Robots-Tag: noai, noimageai` on all responses
|
||||
- Added to all public services via ingress_factory
|
||||
|
||||
**`anti-ai-trap-links` (rewrite-body plugin)**:
|
||||
- Regex: `</body>` -> injects hidden div with trap links + `</body>`
|
||||
- Links point to `https://poison.viktorbarzin.me/article/<slug>`
|
||||
- CSS: invisible to humans (`position:absolute;left:-9999px;height:0;overflow:hidden;aria-hidden=true`)
|
||||
- Only processes `text/html` responses
|
||||
- Requires strip-accept-encoding companion middleware (already exists)
|
||||
- Applied globally via ingress_factory
|
||||
**`anti-ai-trap-links` (rewrite-body plugin)** — REMOVED (Updated 2026-04-17):
|
||||
- Removed due to Traefik v3.6.12 Yaegi runtime bugs making the rewrite-body plugin unreliable
|
||||
- The companion `strip-accept-encoding` middleware was also removed (only existed for rewrite-body)
|
||||
- Trap link injection is no longer active; poison-fountain still serves tarpit content standalone
|
||||
|
||||
### 4. Trap subdomain: poison.viktorbarzin.me
|
||||
|
||||
|
|
@ -88,7 +87,7 @@ All defined in `stacks/platform/modules/traefik/middleware.tf`:
|
|||
|
||||
New variables:
|
||||
- `anti_ai_scraping` (bool, default: true) - enable all anti-AI layers
|
||||
- When true, adds to middleware chain: `ai-bot-block`, `anti-ai-headers`, `strip-accept-encoding`, `anti-ai-trap-links`
|
||||
- When true, adds to middleware chain: `ai-bot-block`, `anti-ai-headers`
|
||||
- Services can opt out with `anti_ai_scraping = false`
|
||||
|
||||
## Human User Protection
|
||||
|
|
@ -97,7 +96,7 @@ New variables:
|
|||
|---------|-----------|
|
||||
| Hidden links visible | CSS `position:absolute;left:-9999px;height:0;overflow:hidden` + `aria-hidden="true"` |
|
||||
| False positive blocking | Only blocks specific AI bot User-Agent strings; no browser matches these |
|
||||
| Performance overhead | ForwardAuth is a string match (<1ms). Rewrite-body already proven with Rybbit. |
|
||||
| Performance overhead | ForwardAuth is a string match (<1ms). Rybbit injected via Cloudflare Worker (not Traefik). |
|
||||
| Poison content leakage | Only served on poison.viktorbarzin.me, not linked from any navigation |
|
||||
| Slow responses | Tarpit only applies to poison.viktorbarzin.me, not to real services |
|
||||
|
||||
|
|
|
|||
150
docs/post-mortems/2026-04-18-authentik-outpost-shm-full.md
Normal file
150
docs/post-mortems/2026-04-18-authentik-outpost-shm-full.md
Normal file
|
|
@ -0,0 +1,150 @@
|
|||
# Post-Mortem: Authentik Embedded Outpost `/dev/shm` Fills — Cluster-Wide Auth Blocked
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Date** | 2026-04-18 |
|
||||
| **Duration** | ~44h for first-affected user (Emil, Apr 16 17:00 → Apr 18 12:40 UTC); ~30min for cluster-wide impact (Apr 18 12:10 → 12:40 UTC) |
|
||||
| **Severity** | SEV2 — authentication blocked for all users on all Authentik-protected services |
|
||||
| **Affected Services** | ~30+ Authentik-protected subdomains (every service using the `authentik-forward-auth` Traefik middleware) |
|
||||
| **Status** | Root cause fixed; permanent mitigation applied; alerting still TODO |
|
||||
|
||||
## Summary
|
||||
|
||||
The `ak-outpost-authentik-embedded-outpost` pod's `/dev/shm` (default 64 MB tmpfs) filled to 100% with ~44,000 `session_*` files. Once full, every forward-auth request failed to write its session state with `ENOSPC` and the outpost returned HTTP 400 instead of the usual 302 → login redirect. All users on all protected services were unable to log in.
|
||||
|
||||
Detection was delayed because the initial user report (Emil) looked like a per-user bug — investigation spent two days chasing hypotheses about non-ASCII headers, user privileges, cookie corruption, and a newly-deployed Cloudflare Worker before the real cause was found in the outpost logs.
|
||||
|
||||
## Impact
|
||||
|
||||
- **User-facing**: HTTP 400 on initial GET of any Authentik-protected site (`terminal`, `grafana`, `immich`, `proxmox`, `london`, etc.). Existing sessions whose cookies were still cached worked until their cookie rotation attempt, then broke.
|
||||
- **Blast radius**: Every service using the `authentik-forward-auth` middleware via the "Domain wide catch all" Proxy provider. Public and internal.
|
||||
- **Duration**: First user (Emil) broken since 2026-04-16 ~17:00 UTC after his last valid session. Cluster-wide block when Viktor's cached session stopped being sufficient — roughly 2026-04-18 12:10 UTC. Fixed 12:40 UTC.
|
||||
- **Data loss**: None. Session state in tmpfs is ephemeral by design.
|
||||
- **Monitoring gap**: No Prometheus alert on outpost `/dev/shm` usage. No alert on outpost 400 response rate. Uptime Kuma external monitors hitting protected services returned 400s for 40+ hours without paging.
|
||||
|
||||
## Timeline (UTC)
|
||||
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| **Apr 15 ~09:21** | `ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` pod started (normal rolling restart, unrelated to this incident). `/dev/shm` fresh. |
|
||||
| **Apr 16 16:23:32** | Emil's last successful `authorize_application` event from his iPhone Brave (`85.255.235.23`). After this point, his subsequent requests create session files — his new sessions work briefly, then `/dev/shm` fills and every new session write fails. |
|
||||
| **Apr 16 ~17:00 (approx)** | `/dev/shm` at ~44,000 files = 100% full. New forward-auth requests start returning 400 across the board. Viktor's browser still has a valid cached cookie so his requests succeed without writing new session files. |
|
||||
| **Apr 17 10:30 (approx)** | Emil reports "terminal.viktorbarzin.me returns 400" to Viktor. |
|
||||
| **Apr 18 09:00–12:30** | Deep investigation begins. Multiple hypotheses tested and rejected: non-ASCII bytes in Emil's `name` field, policy denial, cookie corruption, Rybbit Cloudflare Worker (deployed 2026-04-17 — suspicious timing, turned out unrelated), plaintext redirect scheme. |
|
||||
| **Apr 18 12:20:39** | First direct evidence found: 2 Chrome 400s in Traefik logs from Emil's IP `176.12.22.76` (BG) on `terminal.viktorbarzin.me`, request missing `authentik_proxy_*` cookie. Redirect loop observed on iPhone IPv6 `2620:10d:c092:500::7:8c0d`. |
|
||||
| **Apr 18 12:34** | Viktor reports he can no longer log in either. |
|
||||
| **Apr 18 12:38** | `curl` against direct Traefik (`--resolve` bypassing Cloudflare) returns the same 400 with Authentik's CSP header — Cloudflare Worker exonerated. |
|
||||
| **Apr 18 12:39** | Outpost log grep finds the smoking gun: `failed to save session: write /dev/shm/session_XXX: no space left on device`. |
|
||||
| **Apr 18 12:40:13** | `kubectl delete pod ak-outpost-authentik-embedded-outpost-587598dc4b-fvzzz` — tmpfs cleared on pod restart. Replacement pod `-8qscr` Running within 8s. Cluster unblocked. |
|
||||
| **Apr 18 12:41** | Verified: direct-Traefik and via-CF curls both return `HTTP 302` to Authentik auth flow. Viktor authenticates successfully on `proxmox.viktorbarzin.me`. |
|
||||
| **Apr 18 12:53** | Permanent fix applied via Authentik API: `PATCH /api/v3/outposts/instances/{uuid}/` setting `config.kubernetes_json_patches` to mount `emptyDir {medium: Memory, sizeLimit: 512Mi}` at `/dev/shm`. |
|
||||
| **Apr 18 12:54** | Authentik controller reconciled the Deployment within 5s. `kubectl rollout restart` triggered new pod `-k5hv8`. `/dev/shm` now `tmpfs 256M` (4× the previous capacity; K8s clamps the tmpfs size to pod memory policy, but usage is capped at `sizeLimit=512Mi`). Forward-auth verified working. |
|
||||
|
||||
## Root Cause Chain
|
||||
|
||||
```
|
||||
[1] goauthentik/proxy outpost uses gorilla/sessions FileStore
|
||||
└─> each forward-auth request that has no valid session cookie writes
|
||||
/dev/shm/session_<random> (~1500 bytes/file)
|
||||
│
|
||||
├─> [2] Catch-all Proxy provider's access_token_validity = hours=168 (7 days)
|
||||
│ └─> each file's MaxAge = 7 days
|
||||
│ └─> Upstream 5-min GC (PR #15798, shipped in ≥ 2025.10) can only
|
||||
│ delete files whose MaxAge has EXPIRED, not whose age exceeds any
|
||||
│ shorter threshold
|
||||
│
|
||||
├─> [3] Measured creation rate: ~18 files/min (Uptime-Kuma monitors +
|
||||
│ real user traffic)
|
||||
│ └─> 18/min × 60 × 24 × 7 = 181,440 steady-state files expected
|
||||
│
|
||||
└─> [4] Pod's /dev/shm default: 64 MB tmpfs (Kubernetes default)
|
||||
└─> 64 MB / 1500 bytes ≈ 44,000 files maximum
|
||||
└─> Full in approx 44,000 / (18 × 60) min ≈ 41 hours
|
||||
└─> Actual observed time: pod started Apr 15 ~09:21,
|
||||
first ENOSPC ~Apr 16 ~17:00 ≈ 32 hours
|
||||
(some excess from Uptime-Kuma bursts)
|
||||
|
||||
[ENOSPC] -> every new forward-auth request fails -> outpost returns HTTP 400
|
||||
-> Traefik forwards the 400 to the browser
|
||||
-> user sees "400 Bad Request" on every protected site
|
||||
```
|
||||
|
||||
## Why Diagnosis Took So Long
|
||||
|
||||
The initial report was framed as "Emil can't access terminal" — a per-user symptom. All four pre-registered hypotheses in the triage plan (non-ASCII bytes in header value, oversized cookie, corrupt user attribute, provider policy rejecting groups) were per-user explanations, all of which turned out to be falsified.
|
||||
|
||||
Contributing distractions:
|
||||
1. **Misattribution in initial research** — an `authorize_application` event for Viktor (`vbarzin@gmail.com`) at 2026-04-18 08:09 was initially attributed to Emil. This led to the incorrect conclusion that Emil was authenticating successfully today.
|
||||
2. **Rybbit analytics Cloudflare Worker deployed 2026-04-17** (see memory #792, commit around 2026-04-17 21:26 UTC) ran on `*.viktorbarzin.me/*`. Suspicious timing — Viktor's first instinct was "this must be the Worker." The Worker WAS adding long cookies to browser state, but not the cause of the 400. Exonerated by direct-Traefik curl returning the same 400.
|
||||
3. **Viktor's cached session masked the outage** — only unauthenticated requests wrote new session files. Viktor's valid cookie kept working until the outpost needed to rotate state, at which point he also hit 400.
|
||||
4. **The tell is in the outpost logs, not anywhere else.** `grep 'no space left on device'` on the outpost logs would have found it in seconds, but the investigation scope started with user records, then cookies, then the Worker — outpost logs weren't grepped until hour 3+.
|
||||
|
||||
## Contributing Factors
|
||||
|
||||
1. **No alert on outpost `/dev/shm` usage.** A simple `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.8` or equivalent cAdvisor metric would have paged hours before users noticed.
|
||||
2. **No alert on outpost HTTP 400 rate.** `increase(authentik_outpost_http_requests_total{status="400"}[15m])` went from ~0 to thousands — invisible to our monitoring.
|
||||
3. **No alert on "Uptime-Kuma external monitors all turning red simultaneously."** Every external monitor for a protected service started failing, but each is individually monitored — correlated failures across dozens of services didn't trigger a higher-level alert.
|
||||
4. **Default Kubernetes `/dev/shm` is 64 MB.** This is fine for most workloads, but the goauthentik proxy outpost writes one session file per unauthenticated request with a 7-day retention. The default sizing is an accident waiting to happen on any busy deployment.
|
||||
5. **Upstream issue [#20093](https://github.com/goauthentik/authentik/issues/20093)** ("External Proxy Outpost cannot use persistent session backend") is still OPEN as of 2026-04-18. Known architectural limitation.
|
||||
6. **Catch-all Proxy provider is UI-managed, not Terraform-managed.** Its `access_token_validity` and the outpost's `kubernetes_json_patches` are configured in Authentik's PostgreSQL database, not in code. This means the fix applied today is invisible to `git log` and vulnerable to drift if someone changes it in the UI.
|
||||
|
||||
## Detection Gaps
|
||||
|
||||
| Gap | Impact | Fix |
|
||||
|-----|--------|-----|
|
||||
| No alert on outpost `/dev/shm` usage | Outage progressed from "Emil only" to "everyone" over 40+ hours silently | Add Prometheus alert: `kubelet_volume_stats_used_bytes{namespace="authentik",persistentvolumeclaim=~"dshm.*"} / kubelet_volume_stats_capacity_bytes > 0.8` (or per-container cAdvisor metric if emptyDir not a PVC) |
|
||||
| No alert on outpost 400 rate spike | ~thousands of 400s over 40h didn't page | Alert on `increase(traefik_service_requests_total{code="400",service=~".*viktorbarzin-me.*"}[15m]) > N` OR on outpost-specific 400 metric |
|
||||
| Uptime Kuma external monitors not cross-correlated | Dozens of red monitors didn't trigger a cluster-wide alert | Add meta-alert: "more than N [External] Uptime Kuma monitors down within 10 min" — strong signal of shared-infra failure |
|
||||
| Outpost logs not searched during initial triage | Investigation went down 4 wrong paths before finding the real error | Runbook addition: for any Authentik forward-auth issue, FIRST command is `kubectl -n authentik logs -l goauthentik.io/outpost-name=authentik-embedded-outpost --since=1h \| grep -iE 'error\|no space'` |
|
||||
|
||||
## Prevention Plan
|
||||
|
||||
### P0 — Prevent this exact failure
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P0 | Size `/dev/shm` up via `kubernetes_json_patches` on the embedded outpost config | Config | `PATCH /api/v3/outposts/instances/0eecac07-97c7-443c-8925-05f2f4fe3e47/` with `config.kubernetes_json_patches.deployment` adding an `emptyDir {medium: Memory, sizeLimit: 512Mi}` volume at `/dev/shm`. Authentik reconciles the Deployment within 5 minutes. **Applied 2026-04-18 12:53 UTC.** | **DONE** |
|
||||
|
||||
### P1 — Detect this next time
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P1 | Prometheus alerts on outpost `/dev/shm` fill (two thresholds) | Alert | Group `Authentik Outpost` added in `stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl`. `AuthentikOutpostMemoryHigh` (warning, working set > 1.5 GiB for 15m) + `AuthentikOutpostMemoryCritical` (critical, > 1.8 GiB for 5m) + `AuthentikOutpostRestarts` (warning, > 2 restarts in 30m). Applied 2026-04-18 13:16 UTC; loaded in Prometheus, state=inactive. | **DONE** |
|
||||
| P1 | Uptime-Kuma meta-monitor: "N+ external monitors down simultaneously" | Alert | Either a Prometheus rule over `uptime_kuma_monitor_status == 0` counts, or a dedicated external probe. Very strong signal of shared-infra failure. | TODO |
|
||||
| P1 | Bump tmpfs `sizeLimit` from 512Mi → 2Gi + set explicit container memory limit 2560Mi | Config | Patched outpost `kubernetes_json_patches` via Authentik API. 2026-04-18 13:06 UTC (sizeLimit), 13:22 UTC (container limit). **Gotcha**: `sizeLimit` alone is insufficient — writes to tmpfs count against container cgroup memory, and Kyverno's `tier-defaults` LimitRange sets a default `limits.memory: 256Mi` which OOM-kills the container before tmpfs fills. Fix is to also set `containers[0].resources.limits.memory` ≥ `sizeLimit + working_set_headroom`. Verified 1.5 GB file write succeeds on the configured pod; df reports 2.0 GB tmpfs. Gives ~8× growth headroom at current probe rate. | **DONE** |
|
||||
|
||||
### P2 — Codify the fix so it survives drift
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P2 | Codify the catch-all Proxy provider + embedded outpost config in Terraform | Architecture | Adopt `goauthentik/authentik` Terraform provider in `infra/stacks/authentik/`. Import the existing UUID `0eecac07-97c7-443c-8925-05f2f4fe3e47` and the catch-all provider pk=5. Move `kubernetes_json_patches` into TF so the fix is reviewable in git. | TODO |
|
||||
| P2 | Runbook: Authentik forward-auth troubleshooting | Docs | Add a runbook at `docs/runbooks/authentik-forward-auth-400.md` with the "grep outpost logs first" first step, plus pointer commands for `/dev/shm` usage, session file count, and recent authorize events. | TODO |
|
||||
|
||||
### P3 — Upstream + architectural
|
||||
|
||||
| Priority | Action | Type | Details | Status |
|
||||
|----------|--------|------|---------|--------|
|
||||
| P3 | Comment/support on authentik issue [#20093](https://github.com/goauthentik/authentik/issues/20093) | Upstream | Request either a persistent-backed session store (Redis/DB) OR a configurable GC interval shorter than the default 5 min. | TODO |
|
||||
| P3 | Consider shortening `access_token_validity` from 168h (7 days) to 24h | Config | Reduces steady-state session file count from ~181k to ~26k (7× reduction). Trade-off: users re-auth daily. Viktor's call on UX tolerance. | TODO |
|
||||
| P3 | Evaluate moving forward-auth away from the embedded outpost | Architecture | The embedded outpost is a single replica Go binary with in-memory session state. An external, multi-replica outpost with Redis-backed sessions is the production-grade deployment. Probably overkill for a home-lab, but worth noting. | TODO (paused) |
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **When a per-user bug affects a shared infrastructure layer, suspect the shared layer, not the user.** The framing "Emil gets 400" led the first two hours of investigation down four user-specific rabbit holes. A sanity check ("does ANY user's non-cached request to a protected site return 400?") would have cut to the chase in minutes.
|
||||
|
||||
2. **Check the outpost logs first, not last.** For any Authentik forward-auth oddity, the first `kubectl logs` should be on the outpost pod, grepping for `error` and `ENOSPC`. The outpost is the component that actually makes the 400/302 decision.
|
||||
|
||||
3. **Cache + low-request users mask outages longer than you'd think.** Viktor had a valid cookie and his browser kept using it without writing new session files; he couldn't reproduce the bug Emil saw. The outage felt per-user until his cookie rotation needed to write state. **Any outage that "only affects some users" needs an active check from a fresh, cookie-less context** — `curl` with no cookie jar is the fastest way.
|
||||
|
||||
4. **Default tmpfs sizing + per-request file writes = ticking clock.** 64 MB of `/dev/shm` is a Kubernetes default, not a considered choice. Any workload that writes per-request files into tmpfs without aggressive GC will eventually fill, and the time-to-fill scales inversely with request rate. Worth auditing other services that might have the same pattern.
|
||||
|
||||
5. **UI-managed Authentik config is invisible to git review.** Our catch-all Proxy provider, embedded outpost config, property mappings, and policy bindings are all in Authentik's PostgreSQL database. The fix applied today (`kubernetes_json_patches`) is durable but not discoverable from `git log`. Drift risk. Codify in Terraform.
|
||||
|
||||
6. **Recently-deployed things are prime suspects but not always guilty.** The Rybbit Cloudflare Worker was deployed 2026-04-17 with a wildcard route. Viktor's intuition was "that's the recent change, must be the cause." It was a plausible theory and worth checking — but `curl --resolve` to bypass Cloudflare proved it innocent within 30 seconds. Always have a way to bypass the suspect layer cheaply.
|
||||
|
||||
## References
|
||||
|
||||
- Memory #836-841: incident details stored in claude-memory MCP (2026-04-18 12:42 UTC).
|
||||
- Upstream issue: [goauthentik/authentik#20093](https://github.com/goauthentik/authentik/issues/20093) (open).
|
||||
- Related upstream fix: [PR #15798](https://github.com/goauthentik/authentik/pull/15798) — 5-min session GC shipped in ≥ 2025.10 (our version 2026.2.2 has it, but insufficient alone).
|
||||
- Beads task: `code-zru` (P1 bug).
|
||||
|
|
@ -39,26 +39,43 @@ if [ -z "$VAULT_TOKEN" ] || [ "$VAULT_TOKEN" = "null" ]; then
|
|||
fi
|
||||
echo "Vault authenticated"
|
||||
|
||||
# 5. Fetch DevVM SSH key from Vault
|
||||
curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
|
||||
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/ci/infra | \
|
||||
jq -r '.data.data.devvm_ssh_key' > /tmp/devvm-key
|
||||
chmod 600 /tmp/devvm-key
|
||||
if [ ! -s /tmp/devvm-key ]; then
|
||||
echo "ERROR: Failed to fetch DevVM SSH key"
|
||||
# 5. Fetch API token for claude-agent-service
|
||||
AGENT_TOKEN=$(curl -sf -H "X-Vault-Token: $VAULT_TOKEN" \
|
||||
http://vault-active.vault.svc.cluster.local:8200/v1/secret/data/claude-agent-service | \
|
||||
jq -r '.data.data.api_bearer_token')
|
||||
if [ -z "$AGENT_TOKEN" ] || [ "$AGENT_TOKEN" = "null" ]; then
|
||||
echo "ERROR: Failed to fetch agent API token"
|
||||
exit 1
|
||||
fi
|
||||
echo "SSH key fetched"
|
||||
echo "Agent token fetched"
|
||||
|
||||
# 6. SSH to DevVM and run Claude Code headless
|
||||
# 6. Submit to claude-agent-service
|
||||
TODOS=$(cat /tmp/todos.json)
|
||||
ssh -i /tmp/devvm-key -o StrictHostKeyChecking=no wizard@10.0.10.10 \
|
||||
"cd ~/code && git -C infra stash && git -C infra pull && git -C infra stash pop 2>/dev/null; ~/.local/bin/claude -p \
|
||||
--agent infra/.claude/agents/postmortem-todo-resolver \
|
||||
--dangerously-skip-permissions \
|
||||
--max-budget-usd 5 \
|
||||
'Implement the auto-implementable TODOs from $PM_FILE. Parsed TODO list: $TODOS'"
|
||||
PAYLOAD=$(jq -n \
|
||||
--arg prompt "Implement the auto-implementable TODOs from $PM_FILE. Parsed TODO list: $TODOS" \
|
||||
--arg agent ".claude/agents/postmortem-todo-resolver" \
|
||||
'{prompt: $prompt, agent: $agent, max_budget_usd: 5, timeout_seconds: 900}')
|
||||
|
||||
# 7. Cleanup
|
||||
rm -f /tmp/devvm-key
|
||||
echo "Pipeline complete"
|
||||
RESP=$(curl -sf -X POST \
|
||||
-H "Authorization: Bearer $AGENT_TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "$PAYLOAD" \
|
||||
http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute)
|
||||
JOB_ID=$(echo "$RESP" | jq -r '.job_id')
|
||||
echo "Job submitted: $JOB_ID"
|
||||
|
||||
# 7. Poll for completion (15min max)
|
||||
for i in $(seq 1 60); do
|
||||
sleep 15
|
||||
RESULT=$(curl -sf \
|
||||
-H "Authorization: Bearer $AGENT_TOKEN" \
|
||||
http://claude-agent-service.claude-agent.svc.cluster.local:8080/jobs/$JOB_ID)
|
||||
STATUS=$(echo "$RESULT" | jq -r '.status')
|
||||
echo "[$i/60] Status: $STATUS"
|
||||
if [ "$STATUS" != "running" ]; then
|
||||
echo "$RESULT" | jq .
|
||||
if [ "$STATUS" = "completed" ]; then exit 0; else exit 1; fi
|
||||
fi
|
||||
done
|
||||
echo "ERROR: Job timed out after 15 minutes"
|
||||
exit 1
|
||||
|
|
|
|||
Binary file not shown.
|
|
@ -63,7 +63,7 @@ resource "kubernetes_deployment" "app" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -72,7 +72,7 @@ module "tls_secret" {
|
|||
module "viktor" {
|
||||
source = "./factory"
|
||||
name = "viktor"
|
||||
tag = "26.3.0"
|
||||
tag = "26.4.0"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
nfs_server = var.nfs_server
|
||||
depends_on = [kubernetes_namespace.actualbudget]
|
||||
|
|
@ -95,7 +95,7 @@ module "viktor" {
|
|||
module "anca" {
|
||||
source = "./factory"
|
||||
name = "anca"
|
||||
tag = "26.3.0"
|
||||
tag = "26.4.0"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
nfs_server = var.nfs_server
|
||||
depends_on = [kubernetes_namespace.actualbudget]
|
||||
|
|
@ -118,7 +118,7 @@ module "anca" {
|
|||
module "emo" {
|
||||
source = "./factory"
|
||||
name = "emo"
|
||||
tag = "26.3.0"
|
||||
tag = "26.4.0"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
nfs_server = var.nfs_server
|
||||
depends_on = [kubernetes_namespace.actualbudget]
|
||||
|
|
|
|||
|
|
@ -3,6 +3,12 @@ variable "tls_secret_name" {
|
|||
sensitive = true
|
||||
}
|
||||
|
||||
# Temporary default until GHA pipeline publishes the first 8-char SHA tag.
|
||||
variable "beadboard_image_tag" {
|
||||
type = string
|
||||
default = "latest"
|
||||
}
|
||||
|
||||
resource "kubernetes_namespace" "beads" {
|
||||
metadata {
|
||||
name = "beads-server"
|
||||
|
|
@ -145,7 +151,7 @@ resource "kubernetes_deployment" "dolt" {
|
|||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config
|
||||
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -349,7 +355,7 @@ resource "kubernetes_deployment" "workbench" {
|
|||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config
|
||||
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -386,13 +392,13 @@ module "tls_secret" {
|
|||
}
|
||||
|
||||
module "ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
dns_type = "proxied"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
name = "dolt-workbench"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
protected = false
|
||||
exclude_crowdsec = true
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
dns_type = "proxied"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
name = "dolt-workbench"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
protected = false
|
||||
exclude_crowdsec = true
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "Dolt Workbench"
|
||||
|
|
@ -463,6 +469,38 @@ resource "kubernetes_config_map" "beadboard_config" {
|
|||
}
|
||||
}
|
||||
|
||||
# Pulls the claude-agent-service bearer token from Vault so BeadBoard can
|
||||
# dispatch agent jobs via the in-cluster HTTP API.
|
||||
resource "kubernetes_manifest" "beadboard_agent_service_secret" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
kind = "ExternalSecret"
|
||||
metadata = {
|
||||
name = "beadboard-agent-service"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
refreshInterval = "15m"
|
||||
secretStoreRef = {
|
||||
name = "vault-kv"
|
||||
kind = "ClusterSecretStore"
|
||||
}
|
||||
target = {
|
||||
name = "beadboard-agent-service"
|
||||
}
|
||||
data = [
|
||||
{
|
||||
secretKey = "api_bearer_token"
|
||||
remoteRef = {
|
||||
key = "claude-agent-service"
|
||||
property = "api_bearer_token"
|
||||
}
|
||||
},
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_deployment" "beadboard" {
|
||||
metadata {
|
||||
name = "beadboard"
|
||||
|
|
@ -471,6 +509,9 @@ resource "kubernetes_deployment" "beadboard" {
|
|||
app = "beadboard"
|
||||
tier = local.tiers.aux
|
||||
}
|
||||
annotations = {
|
||||
"reloader.stakater.com/auto" = "true"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
replicas = 1
|
||||
|
|
@ -507,13 +548,28 @@ resource "kubernetes_deployment" "beadboard" {
|
|||
|
||||
container {
|
||||
name = "beadboard"
|
||||
image = "registry.viktorbarzin.me:5050/beadboard:latest"
|
||||
image = "registry.viktorbarzin.me:5050/beadboard:${var.beadboard_image_tag}"
|
||||
|
||||
port {
|
||||
name = "http"
|
||||
container_port = 3000
|
||||
}
|
||||
|
||||
env {
|
||||
name = "CLAUDE_AGENT_SERVICE_URL"
|
||||
value = "http://claude-agent-service.claude-agent.svc.cluster.local:8080"
|
||||
}
|
||||
|
||||
env {
|
||||
name = "CLAUDE_AGENT_BEARER_TOKEN"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "beadboard-agent-service"
|
||||
key = "api_bearer_token"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "beads-writable"
|
||||
mount_path = "/app/.beads"
|
||||
|
|
@ -570,7 +626,7 @@ resource "kubernetes_deployment" "beadboard" {
|
|||
}
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config
|
||||
spec[0].template[0].spec[0].dns_config # KYVERNO_LIFECYCLE_V1
|
||||
]
|
||||
}
|
||||
}
|
||||
|
|
@ -596,13 +652,13 @@ resource "kubernetes_service" "beadboard" {
|
|||
}
|
||||
|
||||
module "beadboard_ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
dns_type = "proxied"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
name = "beadboard"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
protected = true
|
||||
exclude_crowdsec = true
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
dns_type = "proxied"
|
||||
namespace = kubernetes_namespace.beads.metadata[0].name
|
||||
name = "beadboard"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
protected = true
|
||||
exclude_crowdsec = true
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "BeadBoard"
|
||||
|
|
|
|||
|
|
@ -597,3 +597,162 @@ resource "kubernetes_cron_job_v1" "backup" {
|
|||
}
|
||||
}
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Fidelity UK PlanViewer — monthly pension contribution sync
|
||||
#
|
||||
# Architecture notes:
|
||||
# - The CLI (`broker-sync fidelity-ingest`) loads storage_state.json, boots
|
||||
# headless Chromium, scrapes the transaction history + valuation JSON, and
|
||||
# posts DEPOSIT activities to Wealthfolio. See
|
||||
# broker-sync/docs/providers/fidelity-planviewer.md for the seed workflow.
|
||||
# - Storage_state is staged to Vault (`secret/broker-sync` →
|
||||
# `fidelity_storage_state`). ESO projects all broker-sync keys into the
|
||||
# shared `broker-sync-secrets` K8s Secret; an init container writes the
|
||||
# JSON blob to the PVC so the main container can load it.
|
||||
# - Image needs Chromium baked in — add the `fidelity-capable: "true"` label
|
||||
# so the Dockerfile/CI treats this CronJob's pod spec as the Playwright
|
||||
# variant. Until the Playwright image ships, keep `suspend = true`.
|
||||
# - Schedule: 05:00 UK on the 20th of each month — well after Viktor's mid-
|
||||
# month payroll contribution has settled (finance history shows credits
|
||||
# landing 13th-18th).
|
||||
resource "kubernetes_cron_job_v1" "fidelity" {
|
||||
metadata {
|
||||
name = "broker-sync-fidelity"
|
||||
namespace = kubernetes_namespace.broker_sync.metadata[0].name
|
||||
labels = { app = "broker-sync", component = "fidelity" }
|
||||
}
|
||||
spec {
|
||||
schedule = "0 5 20 * *"
|
||||
concurrency_policy = "Forbid"
|
||||
successful_jobs_history_limit = 3
|
||||
failed_jobs_history_limit = 5
|
||||
# Suspended until the broker-sync image ships with Playwright + Chromium.
|
||||
suspend = true
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
ttl_seconds_after_finished = 86400
|
||||
template {
|
||||
metadata {
|
||||
labels = { app = "broker-sync", component = "fidelity" }
|
||||
}
|
||||
spec {
|
||||
restart_policy = "OnFailure"
|
||||
# Materialise the JSON storage_state from the projected Secret
|
||||
# onto the PVC where Playwright expects to read it.
|
||||
init_container {
|
||||
name = "stage-storage-state"
|
||||
image = "busybox:1.36"
|
||||
command = ["/bin/sh", "-c", <<-EOT
|
||||
set -eu
|
||||
mkdir -p /data
|
||||
cp /secrets/fidelity_storage_state /data/fidelity_storage_state.json
|
||||
chmod 600 /data/fidelity_storage_state.json
|
||||
EOT
|
||||
]
|
||||
volume_mount {
|
||||
name = "secrets"
|
||||
mount_path = "/secrets"
|
||||
read_only = true
|
||||
}
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
resources {
|
||||
requests = { cpu = "5m", memory = "8Mi" }
|
||||
limits = { memory = "32Mi" }
|
||||
}
|
||||
}
|
||||
container {
|
||||
name = "broker-sync"
|
||||
image = local.broker_sync_image
|
||||
command = ["broker-sync", "fidelity-ingest"]
|
||||
|
||||
env {
|
||||
name = "BROKER_SYNC_DATA_DIR"
|
||||
value = "/data"
|
||||
}
|
||||
env {
|
||||
name = "WF_SESSION_PATH"
|
||||
value = "/data/wealthfolio_session.json"
|
||||
}
|
||||
env {
|
||||
name = "FIDELITY_STORAGE_STATE_PATH"
|
||||
value = "/data/fidelity_storage_state.json"
|
||||
}
|
||||
env {
|
||||
name = "FIDELITY_PLAN_ID"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "broker-sync-secrets"
|
||||
key = "fidelity_plan_id"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "WF_BASE_URL"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "broker-sync-secrets"
|
||||
key = "wf_base_url"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "WF_USERNAME"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "broker-sync-secrets"
|
||||
key = "wf_username"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "WF_PASSWORD"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "broker-sync-secrets"
|
||||
key = "wf_password"
|
||||
}
|
||||
}
|
||||
}
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
}
|
||||
resources {
|
||||
# Chromium is hungry — headless shell + page rendering
|
||||
# comfortably under 1Gi, spike up to 1.2Gi during full-page
|
||||
# screenshots.
|
||||
requests = { cpu = "50m", memory = "512Mi" }
|
||||
limits = { memory = "1280Mi" }
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "secrets"
|
||||
secret {
|
||||
secret_name = "broker-sync-secrets"
|
||||
items {
|
||||
key = "fidelity_storage_state"
|
||||
path = "fidelity_storage_state"
|
||||
}
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "data"
|
||||
persistent_volume_claim {
|
||||
claim_name = kubernetes_persistent_volume_claim.data_encrypted.metadata[0].name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
|
|
|||
589
stacks/claude-agent-service/main.tf
Normal file
589
stacks/claude-agent-service/main.tf
Normal file
|
|
@ -0,0 +1,589 @@
|
|||
data "vault_kv_secret_v2" "secrets" {
|
||||
mount = "secret"
|
||||
name = "claude-agent-service"
|
||||
}
|
||||
|
||||
data "vault_kv_secret_v2" "viktor_secrets" {
|
||||
mount = "secret"
|
||||
name = "viktor"
|
||||
}
|
||||
|
||||
locals {
|
||||
namespace = "claude-agent"
|
||||
image = "registry.viktorbarzin.me/claude-agent-service"
|
||||
image_tag = "382d6b14"
|
||||
labels = {
|
||||
app = "claude-agent-service"
|
||||
}
|
||||
}
|
||||
|
||||
# --- Namespace ---
|
||||
|
||||
resource "kubernetes_namespace" "claude_agent" {
|
||||
metadata {
|
||||
name = local.namespace
|
||||
labels = {
|
||||
tier = local.tiers.aux
|
||||
"resource-governance/custom-limitrange" = "true"
|
||||
"resource-governance/custom-quota" = "true"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# --- Secrets ---
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
kind = "ExternalSecret"
|
||||
metadata = {
|
||||
name = "claude-agent-secrets"
|
||||
namespace = local.namespace
|
||||
}
|
||||
spec = {
|
||||
refreshInterval = "15m"
|
||||
secretStoreRef = {
|
||||
name = "vault-kv"
|
||||
kind = "ClusterSecretStore"
|
||||
}
|
||||
target = {
|
||||
name = "claude-agent-secrets"
|
||||
}
|
||||
data = [
|
||||
{
|
||||
secretKey = "GITHUB_TOKEN"
|
||||
remoteRef = {
|
||||
key = "viktor"
|
||||
property = "github_pat"
|
||||
}
|
||||
},
|
||||
{
|
||||
secretKey = "API_BEARER_TOKEN"
|
||||
remoteRef = {
|
||||
key = "claude-agent-service"
|
||||
property = "api_bearer_token"
|
||||
}
|
||||
},
|
||||
{
|
||||
# Long-lived OAuth token (1-year) from `claude setup-token`.
|
||||
# Preferred over the short-lived .credentials.json — CLI picks this up and
|
||||
# skips the refresh flow entirely. Rotate yearly; alert 30d before expiry.
|
||||
secretKey = "CLAUDE_CODE_OAUTH_TOKEN"
|
||||
remoteRef = {
|
||||
key = "claude-agent-service"
|
||||
property = "claude_oauth_token"
|
||||
}
|
||||
},
|
||||
]
|
||||
}
|
||||
}
|
||||
depends_on = [kubernetes_namespace.claude_agent]
|
||||
}
|
||||
|
||||
# SOPS age key for terraform state decryption
|
||||
resource "kubernetes_secret" "sops_age_key" {
|
||||
metadata {
|
||||
name = "sops-age-key"
|
||||
namespace = kubernetes_namespace.claude_agent.metadata[0].name
|
||||
}
|
||||
data = {
|
||||
"keys.txt" = data.vault_kv_secret_v2.viktor_secrets.data["sops_age_key_devvm"]
|
||||
}
|
||||
type = "Opaque"
|
||||
}
|
||||
|
||||
# Claude OAuth credentials (for claude -p)
|
||||
resource "kubernetes_secret" "claude_credentials" {
|
||||
metadata {
|
||||
name = "claude-credentials"
|
||||
namespace = kubernetes_namespace.claude_agent.metadata[0].name
|
||||
}
|
||||
data = {
|
||||
".credentials.json" = data.vault_kv_secret_v2.secrets.data["claude_credentials_json"]
|
||||
}
|
||||
type = "Opaque"
|
||||
}
|
||||
|
||||
# git-crypt key for repo decryption
|
||||
resource "kubernetes_config_map" "git_crypt_key" {
|
||||
metadata {
|
||||
name = "git-crypt-key"
|
||||
namespace = kubernetes_namespace.claude_agent.metadata[0].name
|
||||
}
|
||||
binary_data = {
|
||||
"key" = filebase64("${path.root}/../../.git/git-crypt/keys/default")
|
||||
}
|
||||
}
|
||||
|
||||
# --- RBAC ---
|
||||
|
||||
resource "kubernetes_service_account" "claude_agent" {
|
||||
metadata {
|
||||
name = "claude-agent"
|
||||
namespace = kubernetes_namespace.claude_agent.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cluster_role" "claude_agent" {
|
||||
metadata {
|
||||
name = "claude-agent"
|
||||
}
|
||||
|
||||
rule {
|
||||
verbs = ["get", "list", "watch"]
|
||||
api_groups = ["", "apps", "batch"]
|
||||
resources = ["pods", "pods/log", "nodes", "events", "deployments", "services", "namespaces", "jobs", "cronjobs", "configmaps", "replicasets", "statefulsets", "daemonsets"]
|
||||
}
|
||||
|
||||
rule {
|
||||
verbs = ["patch", "update"]
|
||||
api_groups = ["apps"]
|
||||
resources = ["deployments"]
|
||||
}
|
||||
|
||||
rule {
|
||||
verbs = ["create"]
|
||||
api_groups = [""]
|
||||
resources = ["pods/exec"]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cluster_role_binding" "claude_agent" {
|
||||
metadata {
|
||||
name = "claude-agent"
|
||||
}
|
||||
|
||||
subject {
|
||||
kind = "ServiceAccount"
|
||||
name = kubernetes_service_account.claude_agent.metadata[0].name
|
||||
namespace = kubernetes_namespace.claude_agent.metadata[0].name
|
||||
}
|
||||
|
||||
role_ref {
|
||||
api_group = "rbac.authorization.k8s.io"
|
||||
kind = "ClusterRole"
|
||||
name = kubernetes_cluster_role.claude_agent.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
# --- Storage ---
|
||||
|
||||
resource "kubernetes_persistent_volume_claim" "workspace" {
|
||||
wait_until_bound = false
|
||||
metadata {
|
||||
name = "claude-agent-workspace-encrypted"
|
||||
namespace = kubernetes_namespace.claude_agent.metadata[0].name
|
||||
annotations = {
|
||||
"resize.topolvm.io/threshold" = "80%"
|
||||
"resize.topolvm.io/increase" = "100%"
|
||||
"resize.topolvm.io/storage_limit" = "20Gi"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
access_modes = ["ReadWriteOnce"]
|
||||
storage_class_name = "proxmox-lvm-encrypted"
|
||||
resources {
|
||||
requests = {
|
||||
storage = "10Gi"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# --- Deployment ---
|
||||
|
||||
resource "kubernetes_deployment" "claude_agent" {
|
||||
metadata {
|
||||
name = "claude-agent-service"
|
||||
namespace = kubernetes_namespace.claude_agent.metadata[0].name
|
||||
labels = local.labels
|
||||
}
|
||||
|
||||
spec {
|
||||
replicas = 1
|
||||
strategy {
|
||||
type = "Recreate"
|
||||
}
|
||||
|
||||
selector {
|
||||
match_labels = local.labels
|
||||
}
|
||||
|
||||
template {
|
||||
metadata {
|
||||
labels = local.labels
|
||||
}
|
||||
|
||||
spec {
|
||||
service_account_name = kubernetes_service_account.claude_agent.metadata[0].name
|
||||
|
||||
image_pull_secrets {
|
||||
name = "registry-credentials"
|
||||
}
|
||||
|
||||
security_context {
|
||||
run_as_user = 1000
|
||||
run_as_group = 1000
|
||||
fs_group = 1000
|
||||
}
|
||||
|
||||
# Fix workspace ownership (PVC may have root-owned files from prior run)
|
||||
init_container {
|
||||
name = "fix-perms"
|
||||
image = "busybox:1.37"
|
||||
command = ["sh", "-c", "chown -R 1000:1000 /workspace"]
|
||||
security_context {
|
||||
run_as_user = 0
|
||||
}
|
||||
volume_mount {
|
||||
name = "workspace"
|
||||
mount_path = "/workspace"
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
memory = "32Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "64Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Copy Claude credentials to writable volume (CLI needs to refresh OAuth tokens)
|
||||
init_container {
|
||||
name = "copy-claude-creds"
|
||||
image = "busybox:1.37"
|
||||
command = ["sh", "-c", "cp /secrets/claude/.credentials.json /home/agent/.claude/.credentials.json && chown 1000:1000 /home/agent/.claude/.credentials.json"]
|
||||
security_context {
|
||||
run_as_user = 0
|
||||
}
|
||||
volume_mount {
|
||||
name = "claude-credentials-secret"
|
||||
mount_path = "/secrets/claude"
|
||||
}
|
||||
volume_mount {
|
||||
name = "claude-home"
|
||||
mount_path = "/home/agent/.claude"
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
memory = "32Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "64Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Init: clone repo + unlock git-crypt on first run
|
||||
init_container {
|
||||
name = "git-init"
|
||||
image = "${local.image}:${local.image_tag}"
|
||||
command = ["sh", "-c", <<-EOF
|
||||
set -e
|
||||
|
||||
# Configure git with HTTPS + PAT
|
||||
git config --global user.name "Claude Agent Service"
|
||||
git config --global user.email "claude-agent@viktorbarzin.me"
|
||||
git config --global --add safe.directory /workspace/infra
|
||||
git config --global url."https://$${GITHUB_TOKEN}@github.com/".insteadOf "git@github.com:"
|
||||
git config --global url."https://$${GITHUB_TOKEN}@github.com/".insteadOf "https://github.com/"
|
||||
|
||||
# Clone or update repo
|
||||
if [ ! -d /workspace/infra/.git ]; then
|
||||
git clone https://$${GITHUB_TOKEN}@github.com/ViktorBarzin/infra.git /workspace/infra
|
||||
else
|
||||
cd /workspace/infra
|
||||
git fetch origin
|
||||
git reset --hard origin/master
|
||||
fi
|
||||
|
||||
# Unlock git-crypt
|
||||
cd /workspace/infra
|
||||
git-crypt unlock /secrets/git-crypt/key || true
|
||||
EOF
|
||||
]
|
||||
|
||||
env_from {
|
||||
secret_ref {
|
||||
name = "claude-agent-secrets"
|
||||
}
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "workspace"
|
||||
mount_path = "/workspace"
|
||||
}
|
||||
volume_mount {
|
||||
name = "git-crypt-key"
|
||||
mount_path = "/secrets/git-crypt"
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "100m"
|
||||
memory = "256Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "512Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Seed beads metadata + beads-task-runner agent into runtime volumes.
|
||||
# The Dockerfile stages these files at /usr/share/agent-seed/ (image
|
||||
# layer, never mounted). Both /workspace (PVC) and /home/agent/.claude
|
||||
# (emptyDir) are volume mounts that hide any image-layer content, so
|
||||
# the files have to be copied in at pod start. Also creates the
|
||||
# scratch directory the beads-task-runner rails expect.
|
||||
init_container {
|
||||
name = "seed-beads-agent"
|
||||
image = "${local.image}:${local.image_tag}"
|
||||
command = ["sh", "-c", <<-EOT
|
||||
set -e
|
||||
mkdir -p /workspace/.beads /workspace/scratch /home/agent/.claude/agents
|
||||
cp /usr/share/agent-seed/beads-metadata.json /workspace/.beads/metadata.json
|
||||
cp /usr/share/agent-seed/beads-task-runner.md /home/agent/.claude/agents/beads-task-runner.md
|
||||
EOT
|
||||
]
|
||||
|
||||
volume_mount {
|
||||
name = "workspace"
|
||||
mount_path = "/workspace"
|
||||
}
|
||||
volume_mount {
|
||||
name = "claude-home"
|
||||
mount_path = "/home/agent/.claude"
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
memory = "32Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "64Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
container {
|
||||
name = "claude-agent-service"
|
||||
image = "${local.image}:${local.image_tag}"
|
||||
|
||||
port {
|
||||
container_port = 8080
|
||||
}
|
||||
|
||||
env_from {
|
||||
secret_ref {
|
||||
name = "claude-agent-secrets"
|
||||
}
|
||||
}
|
||||
|
||||
env {
|
||||
name = "WORKSPACE_DIR"
|
||||
value = "/workspace/infra"
|
||||
}
|
||||
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/health"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 10
|
||||
period_seconds = 30
|
||||
}
|
||||
|
||||
readiness_probe {
|
||||
http_get {
|
||||
path = "/health"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 5
|
||||
period_seconds = 10
|
||||
}
|
||||
|
||||
volume_mount {
|
||||
name = "workspace"
|
||||
mount_path = "/workspace"
|
||||
}
|
||||
volume_mount {
|
||||
name = "sops-age-key"
|
||||
mount_path = "/home/agent/.config/sops/age"
|
||||
}
|
||||
volume_mount {
|
||||
name = "claude-home"
|
||||
mount_path = "/home/agent/.claude"
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "500m"
|
||||
memory = "2Gi"
|
||||
}
|
||||
limits = {
|
||||
memory = "4Gi"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "workspace"
|
||||
persistent_volume_claim {
|
||||
claim_name = kubernetes_persistent_volume_claim.workspace.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "sops-age-key"
|
||||
secret {
|
||||
secret_name = kubernetes_secret.sops_age_key.metadata[0].name
|
||||
default_mode = "0600"
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "git-crypt-key"
|
||||
config_map {
|
||||
name = kubernetes_config_map.git_crypt_key.metadata[0].name
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "claude-credentials-secret"
|
||||
secret {
|
||||
secret_name = kubernetes_secret.claude_credentials.metadata[0].name
|
||||
default_mode = "0600"
|
||||
}
|
||||
}
|
||||
|
||||
volume {
|
||||
name = "claude-home"
|
||||
empty_dir {}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
# --- Service ---
|
||||
|
||||
resource "kubernetes_service" "claude_agent" {
|
||||
metadata {
|
||||
name = "claude-agent-service"
|
||||
namespace = kubernetes_namespace.claude_agent.metadata[0].name
|
||||
labels = local.labels
|
||||
}
|
||||
|
||||
spec {
|
||||
selector = local.labels
|
||||
|
||||
port {
|
||||
port = 8080
|
||||
target_port = 8080
|
||||
}
|
||||
|
||||
type = "ClusterIP"
|
||||
}
|
||||
}
|
||||
|
||||
# =============================================================================
|
||||
# Token expiry monitor
|
||||
# Long-lived CLAUDE_CODE_OAUTH_TOKEN values expire 1y after mint. We track
|
||||
# mint timestamps here — on rotation, update the map below. A CronJob pushes
|
||||
# the computed expiry_timestamp to Pushgateway, Prometheus alerts 30d out.
|
||||
# =============================================================================
|
||||
locals {
|
||||
claude_oauth_token_mint_epochs = {
|
||||
# unix seconds (UTC) — when `claude setup-token` finished minting
|
||||
"primary" = 1776528429 # 2026-04-18T12:07:09Z (TOKEN2)
|
||||
"spare-1" = 1776528280 # 2026-04-18T12:04:40Z (TOKEN1)
|
||||
"spare-2" = 1776528429 # 2026-04-18T12:07:09Z (TOKEN2 — redundant w/ primary)
|
||||
}
|
||||
claude_oauth_token_ttl_seconds = 365 * 24 * 60 * 60
|
||||
}
|
||||
|
||||
resource "kubernetes_config_map" "claude_oauth_expiry" {
|
||||
metadata {
|
||||
name = "claude-oauth-expiry"
|
||||
namespace = kubernetes_namespace.claude_agent.metadata[0].name
|
||||
}
|
||||
data = {
|
||||
for path, mint in local.claude_oauth_token_mint_epochs :
|
||||
path => tostring(mint + local.claude_oauth_token_ttl_seconds)
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cron_job_v1" "claude_oauth_expiry_monitor" {
|
||||
metadata {
|
||||
name = "claude-oauth-expiry-monitor"
|
||||
namespace = kubernetes_namespace.claude_agent.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
concurrency_policy = "Replace"
|
||||
failed_jobs_history_limit = 3
|
||||
successful_jobs_history_limit = 1
|
||||
schedule = "17 */6 * * *" # every 6h at :17 past
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
ttl_seconds_after_finished = 300
|
||||
template {
|
||||
metadata {}
|
||||
spec {
|
||||
restart_policy = "OnFailure"
|
||||
container {
|
||||
name = "push-expiry"
|
||||
image = "docker.io/curlimages/curl:8.11.0"
|
||||
command = ["/bin/sh", "-c", <<-EOT
|
||||
set -e
|
||||
PG='http://prometheus-prometheus-pushgateway.monitoring:9091/metrics/job/claude-oauth-expiry-monitor'
|
||||
NOW=$(date +%s)
|
||||
PAYLOAD=''
|
||||
PAYLOAD="$${PAYLOAD}# HELP claude_oauth_token_expiry_timestamp Unix epoch when the CLAUDE_CODE_OAUTH_TOKEN for this path expires
|
||||
"
|
||||
PAYLOAD="$${PAYLOAD}# TYPE claude_oauth_token_expiry_timestamp gauge
|
||||
"
|
||||
for path in /mnt/expiry/*; do
|
||||
name=$(basename "$path")
|
||||
exp=$(cat "$path")
|
||||
PAYLOAD="$${PAYLOAD}claude_oauth_token_expiry_timestamp{path=\"$name\"} $exp
|
||||
"
|
||||
done
|
||||
PAYLOAD="$${PAYLOAD}# HELP claude_oauth_expiry_monitor_last_push_timestamp Last time the expiry monitor pushed metrics
|
||||
"
|
||||
PAYLOAD="$${PAYLOAD}# TYPE claude_oauth_expiry_monitor_last_push_timestamp gauge
|
||||
"
|
||||
PAYLOAD="$${PAYLOAD}claude_oauth_expiry_monitor_last_push_timestamp $NOW
|
||||
"
|
||||
echo "$PAYLOAD"
|
||||
echo "$PAYLOAD" | curl -sS --data-binary @- "$PG"
|
||||
echo "pushed at $NOW"
|
||||
EOT
|
||||
]
|
||||
volume_mount {
|
||||
name = "expiry"
|
||||
mount_path = "/mnt/expiry"
|
||||
}
|
||||
resources {
|
||||
requests = { cpu = "10m", memory = "32Mi" }
|
||||
limits = { memory = "64Mi" }
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "expiry"
|
||||
config_map {
|
||||
name = kubernetes_config_map.claude_oauth_expiry.metadata[0].name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -236,6 +236,7 @@ resource "kubernetes_deployment" "claude-memory" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
# DRIFT_WORKAROUND: CI pipeline owns image tag (kubectl set image from Woodpecker/GHA). Reviewed 2026-04-18.
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].container[0].image
|
||||
]
|
||||
|
|
|
|||
|
|
@ -382,13 +382,6 @@ sections:
|
|||
url: https://audiobookshelf.viktorbarzin.me/
|
||||
target: newtab
|
||||
id: 12_1475_audiobookshelf
|
||||
- &ref_42
|
||||
title: Ollama
|
||||
description: Self-hosted ChatGPT (using llama3)
|
||||
icon: si-openai
|
||||
url: https://ollama.viktorbarzin.me/
|
||||
target: newtab
|
||||
id: 13_1475_ollama
|
||||
- &ref_43
|
||||
title: Paperless-ngx
|
||||
description: Document index
|
||||
|
|
@ -411,7 +404,6 @@ sections:
|
|||
- *ref_39
|
||||
- *ref_40
|
||||
- *ref_41
|
||||
- *ref_42
|
||||
- *ref_43
|
||||
- name: Under Construction
|
||||
displayData:
|
||||
|
|
|
|||
|
|
@ -14,14 +14,24 @@ data "vault_kv_secret_v2" "secrets" {
|
|||
name = "platform"
|
||||
}
|
||||
|
||||
module "dbaas" {
|
||||
source = "./modules/dbaas"
|
||||
prod = var.prod
|
||||
tls_secret_name = var.tls_secret_name
|
||||
nfs_server = var.nfs_server
|
||||
dbaas_root_password = data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]
|
||||
postgresql_root_password = data.vault_kv_secret_v2.secrets.data["dbaas_postgresql_root_password"]
|
||||
pgadmin_password = data.vault_kv_secret_v2.secrets.data["dbaas_pgadmin_password"]
|
||||
kube_config_path = var.kube_config_path
|
||||
tier = local.tiers.cluster
|
||||
# Personal/app-user secrets (forgejo + roundcubemail MySQL passwords live here,
|
||||
# not under secret/platform, to match the "secret/viktor as the go-to personal
|
||||
# vault" convention documented in .claude/CLAUDE.md).
|
||||
data "vault_kv_secret_v2" "viktor" {
|
||||
mount = "secret"
|
||||
name = "viktor"
|
||||
}
|
||||
|
||||
module "dbaas" {
|
||||
source = "./modules/dbaas"
|
||||
prod = var.prod
|
||||
tls_secret_name = var.tls_secret_name
|
||||
nfs_server = var.nfs_server
|
||||
dbaas_root_password = data.vault_kv_secret_v2.secrets.data["dbaas_root_password"]
|
||||
postgresql_root_password = data.vault_kv_secret_v2.secrets.data["dbaas_postgresql_root_password"]
|
||||
pgadmin_password = data.vault_kv_secret_v2.secrets.data["dbaas_pgadmin_password"]
|
||||
mysql_forgejo_password = data.vault_kv_secret_v2.viktor.data["mysql_forgejo_password"]
|
||||
mysql_roundcubemail_password = data.vault_kv_secret_v2.viktor.data["mysql_roundcubemail_password"]
|
||||
kube_config_path = var.kube_config_path
|
||||
tier = local.tiers.cluster
|
||||
}
|
||||
|
|
|
|||
|
|
@ -17,6 +17,18 @@ variable "kube_config_path" {
|
|||
sensitive = true
|
||||
}
|
||||
|
||||
# MySQL static application users (not rotated by Vault DB engine; baked into
|
||||
# each app's config). Codified here so future MySQL rebuilds cannot silently
|
||||
# drop them.
|
||||
variable "mysql_forgejo_password" {
|
||||
type = string
|
||||
sensitive = true
|
||||
}
|
||||
variable "mysql_roundcubemail_password" {
|
||||
type = string
|
||||
sensitive = true
|
||||
}
|
||||
|
||||
resource "kubernetes_namespace" "dbaas" {
|
||||
metadata {
|
||||
name = "dbaas"
|
||||
|
|
@ -537,7 +549,7 @@ resource "kubernetes_stateful_set_v1" "mysql_standalone" {
|
|||
}
|
||||
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -562,6 +574,55 @@ resource "kubernetes_service" "mysql" {
|
|||
depends_on = [kubernetes_stateful_set_v1.mysql_standalone]
|
||||
}
|
||||
|
||||
# MySQL static application users — not rotated by Vault DB engine.
|
||||
# Each app stores its password in its own config (forgejo app.ini, roundcube
|
||||
# ROUNDCUBEMAIL_DB_PASSWORD env). During the 2026-04-16 InnoDB Cluster →
|
||||
# standalone migration these users were accidentally dropped and recreated with
|
||||
# mismatched passwords; this block codifies them so a future rebuild cannot
|
||||
# silently break the apps.
|
||||
#
|
||||
# Pattern matches `null_resource.pg_terraform_state_db` below (local-exec into
|
||||
# the DB pod). We CREATE IF NOT EXISTS + ALTER USER on every apply so a
|
||||
# password rotation in Vault is re-synced on the next `scripts/tg apply`. The
|
||||
# `password_hash` trigger re-runs the provisioner when the Vault password
|
||||
# changes; the namespace/user triggers re-run if identifiers change.
|
||||
locals {
|
||||
mysql_static_users = {
|
||||
forgejo = {
|
||||
database = "forgejo"
|
||||
password = var.mysql_forgejo_password
|
||||
}
|
||||
roundcubemail = {
|
||||
database = "roundcubemail"
|
||||
password = var.mysql_roundcubemail_password
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "null_resource" "mysql_static_user" {
|
||||
for_each = local.mysql_static_users
|
||||
|
||||
depends_on = [kubernetes_stateful_set_v1.mysql_standalone]
|
||||
|
||||
triggers = {
|
||||
username = each.key
|
||||
database = each.value.database
|
||||
password_hash = sha256(each.value.password)
|
||||
}
|
||||
|
||||
provisioner "local-exec" {
|
||||
command = <<EOT
|
||||
kubectl --kubeconfig ${var.kube_config_path} exec -i -n dbaas mysql-standalone-0 -c mysql -- sh -c 'exec mysql -uroot -p"$MYSQL_ROOT_PASSWORD"' <<'SQL'
|
||||
CREATE DATABASE IF NOT EXISTS `${each.value.database}`;
|
||||
CREATE USER IF NOT EXISTS '${each.key}'@'%' IDENTIFIED WITH caching_sha2_password BY '${each.value.password}';
|
||||
ALTER USER '${each.key}'@'%' IDENTIFIED WITH caching_sha2_password BY '${each.value.password}';
|
||||
GRANT ALL PRIVILEGES ON `${each.value.database}`.* TO '${each.key}'@'%';
|
||||
FLUSH PRIVILEGES;
|
||||
SQL
|
||||
EOT
|
||||
}
|
||||
}
|
||||
|
||||
module "nfs_mysql_backup_host" {
|
||||
source = "../../../../modules/kubernetes/nfs_volume"
|
||||
name = "dbaas-mysql-backup-host"
|
||||
|
|
@ -1379,6 +1440,30 @@ resource "null_resource" "pg_terraform_state_db" {
|
|||
}
|
||||
}
|
||||
|
||||
# Create payslip_ingest database for the payslip-ingest webhook service.
|
||||
# Role password is managed by Vault Database Secrets Engine (static role `pg-payslip-ingest`, 7d rotation).
|
||||
resource "null_resource" "pg_payslip_ingest_db" {
|
||||
depends_on = [null_resource.pg_cluster]
|
||||
|
||||
triggers = {
|
||||
db_name = "payslip_ingest"
|
||||
username = "payslip_ingest"
|
||||
}
|
||||
|
||||
provisioner "local-exec" {
|
||||
command = <<-EOT
|
||||
kubectl --kubeconfig ${var.kube_config_path} exec -n dbaas pg-cluster-1 -c postgres -- \
|
||||
bash -c '
|
||||
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = '"'"'payslip_ingest'"'"'" | grep -q 1 || \
|
||||
psql -U postgres -c "CREATE ROLE payslip_ingest WITH LOGIN PASSWORD '"'"'changeme-vault-will-rotate'"'"'"
|
||||
psql -U postgres -tc "SELECT 1 FROM pg_catalog.pg_database WHERE datname = '"'"'payslip_ingest'"'"'" | grep -q 1 || \
|
||||
psql -U postgres -c "CREATE DATABASE payslip_ingest OWNER payslip_ingest"
|
||||
psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE payslip_ingest TO payslip_ingest"
|
||||
'
|
||||
EOT
|
||||
}
|
||||
}
|
||||
|
||||
# Old PostgreSQL deployment — kept commented for rollback reference
|
||||
# resource "kubernetes_deployment" "postgres" {
|
||||
# metadata {
|
||||
|
|
|
|||
|
|
@ -226,6 +226,6 @@ resource "kubernetes_deployment" "diun" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -346,7 +346,7 @@ resource "kubernetes_deployment" "calibre-web-automated" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -466,7 +466,7 @@ resource "kubernetes_deployment" "annas-archive-stacks" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -615,7 +615,7 @@ resource "kubernetes_deployment" "audiobookshelf" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -876,7 +876,7 @@ resource "kubernetes_deployment" "book_search" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -158,11 +158,12 @@ module "ingress" {
|
|||
name = "forgejo"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "Forgejo"
|
||||
"gethomepage.dev/description" = "Git hosting"
|
||||
"gethomepage.dev/icon" = "forgejo.png"
|
||||
"gethomepage.dev/group" = "Development & CI"
|
||||
"gethomepage.dev/pod-selector" = ""
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "Forgejo"
|
||||
"gethomepage.dev/description" = "Git hosting"
|
||||
"gethomepage.dev/icon" = "forgejo.png"
|
||||
"gethomepage.dev/group" = "Development & CI"
|
||||
"gethomepage.dev/pod-selector" = ""
|
||||
"uptime.viktorbarzin.me/external-monitor-path" = "/api/healthz"
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -193,7 +193,7 @@ resource "kubernetes_deployment" "freedify" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -43,7 +43,6 @@ locals {
|
|||
mailserver_accounts = jsondecode(data.kubernetes_secret.eso_secrets.data["mailserver_accounts"])
|
||||
}
|
||||
variable "redis_host" { type = string }
|
||||
variable "ollama_host" { type = string }
|
||||
variable "mail_host" { type = string }
|
||||
|
||||
|
||||
|
|
@ -62,10 +61,10 @@ module "tls_secret" {
|
|||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
|
||||
resource "kubernetes_persistent_volume_claim" "data_proxmox" {
|
||||
resource "kubernetes_persistent_volume_claim" "data_encrypted" {
|
||||
wait_until_bound = false
|
||||
metadata {
|
||||
name = "grampsweb-data-proxmox"
|
||||
name = "grampsweb-data-encrypted"
|
||||
namespace = kubernetes_namespace.grampsweb.metadata[0].name
|
||||
annotations = {
|
||||
"resize.topolvm.io/threshold" = "80%"
|
||||
|
|
@ -75,7 +74,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
|
|||
}
|
||||
spec {
|
||||
access_modes = ["ReadWriteOnce"]
|
||||
storage_class_name = "proxmox-lvm"
|
||||
storage_class_name = "proxmox-lvm-encrypted"
|
||||
resources {
|
||||
requests = {
|
||||
storage = "1Gi"
|
||||
|
|
@ -147,14 +146,6 @@ locals {
|
|||
name = "GRAMPSWEB_DEFAULT_FROM_EMAIL"
|
||||
value = "info@viktorbarzin.me"
|
||||
},
|
||||
{
|
||||
name = "GRAMPSWEB_LLM_BASE_URL"
|
||||
value = "http://${var.ollama_host}:11434/v1"
|
||||
},
|
||||
{
|
||||
name = "GRAMPSWEB_LLM_MODEL"
|
||||
value = "llama3.1"
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
|
|
@ -325,7 +316,7 @@ resource "kubernetes_deployment" "grampsweb" {
|
|||
volume {
|
||||
name = "data"
|
||||
persistent_volume_claim {
|
||||
claim_name = kubernetes_persistent_volume_claim.data_proxmox.metadata[0].name
|
||||
claim_name = kubernetes_persistent_volume_claim.data_encrypted.metadata[0].name
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -374,7 +374,7 @@ resource "kubernetes_deployment" "hermes_agent" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -62,9 +62,6 @@ Widget-capable matches (candidate): **27**
|
|||
| `nextcloud` | `whiteboard` | `https://whiteboard.viktorbarzin.me` | `nextcloud` |
|
||||
| `ntfy` | `ntfy` | `https://ntfy.viktorbarzin.me` | `link-only` |
|
||||
| `nvidia` | `nvidia-exporter` | `https://nvidia-exporter.viktorbarzin.lan` | `link-only` |
|
||||
| `ollama` | `ollama` | `https://ollama.viktorbarzin.me` | `link-only` |
|
||||
| `ollama` | `ollama-api` | `https://ollama-api.viktorbarzin.me` | `link-only` |
|
||||
| `ollama` | `ollama-server` | `https://ollama-server.viktorbarzin.lan` | `link-only` |
|
||||
| `onlyoffice` | `onlyoffice` | `https://onlyoffice.viktorbarzin.me` | `link-only` |
|
||||
| `openclaw` | `openclaw` | `https://openclaw.viktorbarzin.me` | `link-only` |
|
||||
| `owntracks` | `owntracks` | `https://owntracks.viktorbarzin.me` | `link-only` |
|
||||
|
|
|
|||
|
|
@ -145,7 +145,7 @@ resource "kubernetes_deployment" "immich_server" {
|
|||
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config,
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
]
|
||||
}
|
||||
|
||||
|
|
@ -373,7 +373,7 @@ resource "kubernetes_deployment" "immich-postgres" {
|
|||
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config,
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
]
|
||||
}
|
||||
|
||||
|
|
@ -532,7 +532,7 @@ resource "kubernetes_deployment" "immich-machine-learning" {
|
|||
|
||||
lifecycle {
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config,
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
]
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -198,7 +198,7 @@ resource "kubernetes_deployment" "insta2spotify" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -1,14 +0,0 @@
|
|||
variable "nfs_server" { type = string }
|
||||
|
||||
data "vault_kv_secret_v2" "secrets" {
|
||||
mount = "secret"
|
||||
name = "platform"
|
||||
}
|
||||
|
||||
module "iscsi-csi" {
|
||||
source = "./modules/iscsi-csi"
|
||||
tier = local.tiers.cluster
|
||||
truenas_host = var.nfs_server
|
||||
truenas_api_key = data.vault_kv_secret_v2.secrets.data["truenas_api_key"]
|
||||
truenas_ssh_private_key = data.vault_kv_secret_v2.secrets.data["truenas_ssh_private_key"]
|
||||
}
|
||||
|
|
@ -1,148 +0,0 @@
|
|||
resource "kubernetes_namespace" "iscsi_csi" {
|
||||
metadata {
|
||||
name = "iscsi-csi"
|
||||
labels = {
|
||||
tier = var.tier
|
||||
"resource-governance/custom-quota" = "true"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "helm_release" "democratic_csi" {
|
||||
namespace = kubernetes_namespace.iscsi_csi.metadata[0].name
|
||||
create_namespace = false
|
||||
name = "democratic-csi-iscsi"
|
||||
atomic = true
|
||||
timeout = 300
|
||||
|
||||
repository = "https://democratic-csi.github.io/charts/"
|
||||
chart = "democratic-csi"
|
||||
|
||||
values = [yamlencode({
|
||||
csiDriver = {
|
||||
name = "org.democratic-csi.iscsi"
|
||||
}
|
||||
|
||||
storageClasses = [{
|
||||
name = "iscsi-truenas"
|
||||
defaultClass = false
|
||||
reclaimPolicy = "Retain"
|
||||
volumeBindingMode = "Immediate"
|
||||
allowVolumeExpansion = true
|
||||
parameters = {
|
||||
fsType = "ext4"
|
||||
}
|
||||
mountOptions = []
|
||||
}]
|
||||
|
||||
controller = {
|
||||
replicas = 2
|
||||
driver = {
|
||||
resources = {
|
||||
requests = { cpu = "25m", memory = "192Mi" }
|
||||
limits = { memory = "192Mi" }
|
||||
}
|
||||
}
|
||||
externalProvisioner = {
|
||||
resources = {
|
||||
requests = { cpu = "5m", memory = "64Mi" }
|
||||
limits = { memory = "64Mi" }
|
||||
}
|
||||
}
|
||||
externalAttacher = {
|
||||
resources = {
|
||||
requests = { cpu = "5m", memory = "64Mi" }
|
||||
limits = { memory = "64Mi" }
|
||||
}
|
||||
}
|
||||
externalResizer = {
|
||||
resources = {
|
||||
requests = { cpu = "5m", memory = "64Mi" }
|
||||
limits = { memory = "64Mi" }
|
||||
}
|
||||
}
|
||||
externalSnapshotter = {
|
||||
resources = {
|
||||
requests = { cpu = "5m", memory = "80Mi" }
|
||||
limits = { memory = "80Mi" }
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# csiProxy is a top-level chart key, NOT nested under controller/node
|
||||
csiProxy = {
|
||||
resources = {
|
||||
requests = { cpu = "5m", memory = "32Mi" }
|
||||
limits = { memory = "32Mi" }
|
||||
}
|
||||
}
|
||||
|
||||
node = {
|
||||
driver = {
|
||||
resources = {
|
||||
requests = { cpu = "25m", memory = "192Mi" }
|
||||
limits = { memory = "192Mi" }
|
||||
}
|
||||
}
|
||||
driverRegistrar = {
|
||||
resources = {
|
||||
requests = { cpu = "5m", memory = "32Mi" }
|
||||
limits = { memory = "32Mi" }
|
||||
}
|
||||
}
|
||||
cleanup = {
|
||||
resources = {
|
||||
requests = { cpu = "5m", memory = "32Mi" }
|
||||
limits = { memory = "32Mi" }
|
||||
}
|
||||
}
|
||||
|
||||
hostPID = true
|
||||
hostPath = "/lib/modules"
|
||||
}
|
||||
|
||||
driver = {
|
||||
config = {
|
||||
driver = "freenas-iscsi"
|
||||
|
||||
instance_id = "truenas-iscsi"
|
||||
|
||||
httpConnection = {
|
||||
protocol = "http"
|
||||
host = var.truenas_host
|
||||
port = 80
|
||||
apiKey = var.truenas_api_key
|
||||
}
|
||||
|
||||
sshConnection = {
|
||||
host = var.truenas_host
|
||||
port = 22
|
||||
username = "root"
|
||||
privateKey = var.truenas_ssh_private_key
|
||||
}
|
||||
|
||||
zfs = {
|
||||
datasetParentName = "main/iscsi"
|
||||
detachedSnapshotsDatasetParentName = "main/iscsi-snaps"
|
||||
}
|
||||
|
||||
iscsi = {
|
||||
targetPortal = "${var.truenas_host}:3260"
|
||||
namePrefix = "csi-"
|
||||
nameSuffix = ""
|
||||
targetGroups = [{
|
||||
targetGroupPortalGroup = 1
|
||||
targetGroupInitiatorGroup = 1
|
||||
targetGroupAuthType = "None"
|
||||
}]
|
||||
extentInsecureTpc = true
|
||||
extentXenCompat = false
|
||||
extentDisablePhysicalBlocksize = true
|
||||
extentBlocksize = 512
|
||||
extentRpm = "SSD"
|
||||
extentAvailThreshold = 0
|
||||
}
|
||||
}
|
||||
}
|
||||
})]
|
||||
}
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
variable "tier" { type = string }
|
||||
variable "truenas_host" { type = string }
|
||||
variable "truenas_api_key" {
|
||||
type = string
|
||||
sensitive = true
|
||||
}
|
||||
variable "truenas_ssh_private_key" {
|
||||
type = string
|
||||
sensitive = true
|
||||
}
|
||||
|
|
@ -1 +0,0 @@
|
|||
../../secrets
|
||||
|
|
@ -1,8 +0,0 @@
|
|||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
dependency "infra" {
|
||||
config_path = "../infra"
|
||||
skip_outputs = true
|
||||
}
|
||||
|
|
@ -113,8 +113,9 @@ resource "kubernetes_deployment" "k8s_portal" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
# DRIFT_WORKAROUND: CI pipeline owns image tag (kubectl set image from Woodpecker/GHA); Kyverno mutates dns_config for ndots. Reviewed 2026-04-18.
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].dns_config,
|
||||
spec[0].template[0].spec[0].dns_config, # KYVERNO_LIFECYCLE_V1
|
||||
spec[0].template[0].spec[0].container[0].image, # CI updates image tag
|
||||
]
|
||||
}
|
||||
|
|
|
|||
|
|
@ -1025,3 +1025,131 @@ resource "kubernetes_manifest" "cleanup_failed_pods" {
|
|||
}
|
||||
}
|
||||
}
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Strip CPU Limits (Kyverno Mutate)
|
||||
# -----------------------------------------------------------------------------
|
||||
# Removes resources.limits.cpu from every container and initContainer at pod
|
||||
# admission. Memory limits are preserved. Cluster policy: CFS throttling causes
|
||||
# more harm than good for bursty single-threaded workloads (Node.js, Python
|
||||
# apps). Upstream Helm charts (CrowdSec, descheduler, kubernetes-dashboard,
|
||||
# nvidia gpu-operator) still ship CPU limits — this strips them declaratively
|
||||
# so we don't have to fork values.yaml per chart.
|
||||
#
|
||||
# Scope: admission-time only. Existing pods keep their limits until restarted
|
||||
# naturally (Helm upgrade, node drain, rollout). No mutateExistingOnPolicyUpdate.
|
||||
#
|
||||
# JSON6902 remove op fails on missing paths — per-element precondition gates
|
||||
# the mutation so pods without CPU limits pass through untouched.
|
||||
|
||||
resource "kubernetes_manifest" "mutate_strip_cpu_limits" {
|
||||
manifest = {
|
||||
apiVersion = "kyverno.io/v1"
|
||||
kind = "ClusterPolicy"
|
||||
metadata = {
|
||||
name = "strip-cpu-limits"
|
||||
annotations = {
|
||||
"policies.kyverno.io/title" = "Strip CPU Limits"
|
||||
"policies.kyverno.io/description" = join("", [
|
||||
"Removes resources.limits.cpu from every container and initContainer ",
|
||||
"at pod admission. Memory limits are preserved. Cluster policy: CFS ",
|
||||
"throttling causes more harm than good for bursty single-threaded ",
|
||||
"workloads (Node.js, Python apps).",
|
||||
])
|
||||
}
|
||||
}
|
||||
spec = {
|
||||
background = false
|
||||
rules = [
|
||||
{
|
||||
name = "strip-container-cpu-limit"
|
||||
match = {
|
||||
any = [
|
||||
{
|
||||
resources = {
|
||||
kinds = ["Pod"]
|
||||
operations = ["CREATE"]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
preconditions = {
|
||||
all = [
|
||||
{
|
||||
key = "{{ request.object.spec.containers[?resources.limits.cpu != null] | length(@) }}"
|
||||
operator = "GreaterThan"
|
||||
value = 0
|
||||
}
|
||||
]
|
||||
}
|
||||
mutate = {
|
||||
foreach = [
|
||||
{
|
||||
list = "request.object.spec.containers"
|
||||
preconditions = {
|
||||
all = [
|
||||
{
|
||||
key = "{{ element.resources.limits.cpu || '' }}"
|
||||
operator = "NotEquals"
|
||||
value = ""
|
||||
}
|
||||
]
|
||||
}
|
||||
patchesJson6902 = yamlencode([
|
||||
{
|
||||
op = "remove"
|
||||
path = "/spec/containers/{{ elementIndex }}/resources/limits/cpu"
|
||||
}
|
||||
])
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
name = "strip-initcontainer-cpu-limit"
|
||||
match = {
|
||||
any = [
|
||||
{
|
||||
resources = {
|
||||
kinds = ["Pod"]
|
||||
operations = ["CREATE"]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
preconditions = {
|
||||
all = [
|
||||
{
|
||||
key = "{{ request.object.spec.initContainers[?resources.limits.cpu != null] || `[]` | length(@) }}"
|
||||
operator = "GreaterThan"
|
||||
value = 0
|
||||
}
|
||||
]
|
||||
}
|
||||
mutate = {
|
||||
foreach = [
|
||||
{
|
||||
list = "request.object.spec.initContainers"
|
||||
preconditions = {
|
||||
all = [
|
||||
{
|
||||
key = "{{ element.resources.limits.cpu || '' }}"
|
||||
operator = "NotEquals"
|
||||
value = ""
|
||||
}
|
||||
]
|
||||
}
|
||||
patchesJson6902 = yamlencode([
|
||||
{
|
||||
op = "remove"
|
||||
path = "/spec/initContainers/{{ elementIndex }}/resources/limits/cpu"
|
||||
}
|
||||
])
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -127,9 +127,10 @@ resource "kubernetes_config_map" "mailserver_config" {
|
|||
logtarget = SYSOUT
|
||||
EOF
|
||||
}
|
||||
# Password hashes are different each time and avoid changing secret constantly.
|
||||
# Password hashes are different each time and avoid changing secret constantly.
|
||||
# Either 1.Create consistent hashes or 2.Find a way to ignore_changes on per password
|
||||
lifecycle {
|
||||
# DRIFT_WORKAROUND: postfix-accounts.cf password hashes non-deterministic; would flap on every apply. Reviewed 2026-04-18.
|
||||
ignore_changes = [data["postfix-accounts.cf"]]
|
||||
}
|
||||
}
|
||||
|
|
|
|||
224
stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json
Normal file
224
stacks/monitoring/modules/monitoring/dashboards/uk-payslip.json
Normal file
|
|
@ -0,0 +1,224 @@
|
|||
{
|
||||
"annotations": {
|
||||
"list": [
|
||||
{
|
||||
"builtIn": 1,
|
||||
"datasource": { "type": "datasource", "uid": "grafana" },
|
||||
"enable": true,
|
||||
"hide": true,
|
||||
"iconColor": "rgba(0, 211, 255, 1)",
|
||||
"name": "Annotations & Alerts",
|
||||
"type": "dashboard"
|
||||
}
|
||||
]
|
||||
},
|
||||
"description": "UK payslip breakdown — gross/net/tax/NI trends, YTD progression against income tax bands, deductions split, and effective rate.",
|
||||
"editable": true,
|
||||
"fiscalYearStartMonth": 0,
|
||||
"graphTooltip": 1,
|
||||
"id": null,
|
||||
"links": [],
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Monthly gross / net / tax / NI",
|
||||
"type": "timeseries",
|
||||
"datasource": { "type": "postgres", "uid": "payslips-pg" },
|
||||
"gridPos": { "h": 9, "w": 12, "x": 0, "y": 0 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"unit": "currencyGBP",
|
||||
"custom": {
|
||||
"axisBorderShow": false,
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": { "type": "linear" },
|
||||
"showPoints": "auto",
|
||||
"spanNulls": false,
|
||||
"stacking": { "group": "A", "mode": "none" },
|
||||
"thresholdsStyle": { "mode": "off" }
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": { "calcs": ["last", "mean"], "displayMode": "table", "placement": "bottom" },
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "postgres", "uid": "payslips-pg" },
|
||||
"rawSql": "SELECT pay_date AS \"time\", gross_pay, net_pay, income_tax, national_insurance FROM payslip_ingest.payslip WHERE $__timeFilter(pay_date) ORDER BY pay_date",
|
||||
"format": "time_series",
|
||||
"refId": "A"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "YTD gross (this tax year) with UK band thresholds",
|
||||
"type": "timeseries",
|
||||
"datasource": { "type": "postgres", "uid": "payslips-pg" },
|
||||
"gridPos": { "h": 9, "w": 12, "x": 12, "y": 0 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"unit": "currencyGBP",
|
||||
"custom": {
|
||||
"axisBorderShow": false,
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "YTD gross",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 15,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": { "type": "linear" },
|
||||
"showPoints": "auto",
|
||||
"spanNulls": false,
|
||||
"stacking": { "group": "A", "mode": "none" },
|
||||
"thresholdsStyle": { "mode": "line" }
|
||||
},
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{ "color": "green", "value": null },
|
||||
{ "color": "yellow", "value": 12570 },
|
||||
{ "color": "orange", "value": 50270 },
|
||||
{ "color": "red", "value": 125140 }
|
||||
]
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": { "calcs": ["last", "max"], "displayMode": "table", "placement": "bottom" },
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "postgres", "uid": "payslips-pg" },
|
||||
"rawSql": "SELECT pay_date AS \"time\", SUM(gross_pay) OVER (PARTITION BY tax_year ORDER BY pay_date) AS ytd_gross FROM payslip_ingest.payslip WHERE $__timeFilter(pay_date) ORDER BY pay_date",
|
||||
"format": "time_series",
|
||||
"refId": "A"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Deductions breakdown per payslip",
|
||||
"type": "timeseries",
|
||||
"datasource": { "type": "postgres", "uid": "payslips-pg" },
|
||||
"gridPos": { "h": 9, "w": 12, "x": 0, "y": 9 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"unit": "currencyGBP",
|
||||
"custom": {
|
||||
"axisBorderShow": false,
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "bars",
|
||||
"fillOpacity": 80,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
|
||||
"lineWidth": 1,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": { "type": "linear" },
|
||||
"showPoints": "never",
|
||||
"spanNulls": false,
|
||||
"stacking": { "group": "A", "mode": "normal" },
|
||||
"thresholdsStyle": { "mode": "off" }
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": { "calcs": ["sum", "mean"], "displayMode": "table", "placement": "bottom" },
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "postgres", "uid": "payslips-pg" },
|
||||
"rawSql": "SELECT pay_date AS \"time\", income_tax, national_insurance, pension_employee, student_loan FROM payslip_ingest.payslip WHERE $__timeFilter(pay_date) ORDER BY pay_date",
|
||||
"format": "time_series",
|
||||
"refId": "A"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "Latest effective rate & take-home %",
|
||||
"type": "timeseries",
|
||||
"datasource": { "type": "postgres", "uid": "payslips-pg" },
|
||||
"gridPos": { "h": 9, "w": 12, "x": 12, "y": 9 },
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"color": { "mode": "palette-classic" },
|
||||
"unit": "percent",
|
||||
"min": 0,
|
||||
"max": 100,
|
||||
"custom": {
|
||||
"axisBorderShow": false,
|
||||
"axisCenteredZero": false,
|
||||
"axisColorMode": "text",
|
||||
"axisLabel": "",
|
||||
"axisPlacement": "auto",
|
||||
"barAlignment": 0,
|
||||
"drawStyle": "line",
|
||||
"fillOpacity": 10,
|
||||
"gradientMode": "none",
|
||||
"hideFrom": { "legend": false, "tooltip": false, "viz": false },
|
||||
"lineWidth": 2,
|
||||
"pointSize": 5,
|
||||
"scaleDistribution": { "type": "linear" },
|
||||
"showPoints": "auto",
|
||||
"spanNulls": false,
|
||||
"stacking": { "group": "A", "mode": "none" },
|
||||
"thresholdsStyle": { "mode": "off" }
|
||||
}
|
||||
},
|
||||
"overrides": []
|
||||
},
|
||||
"options": {
|
||||
"legend": { "calcs": ["last", "mean"], "displayMode": "table", "placement": "bottom" },
|
||||
"tooltip": { "mode": "multi", "sort": "desc" }
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"datasource": { "type": "postgres", "uid": "payslips-pg" },
|
||||
"rawSql": "SELECT pay_date AS \"time\", ROUND(((income_tax + national_insurance)::numeric / NULLIF(gross_pay, 0)) * 100, 2) AS \"effective_rate_pct\", ROUND((net_pay::numeric / NULLIF(gross_pay, 0)) * 100, 2) AS \"take_home_pct\" FROM payslip_ingest.payslip WHERE $__timeFilter(pay_date) ORDER BY pay_date",
|
||||
"format": "time_series",
|
||||
"refId": "A"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"refresh": "5m",
|
||||
"schemaVersion": 39,
|
||||
"tags": ["finance", "personal", "uk-tax"],
|
||||
"templating": { "list": [] },
|
||||
"time": { "from": "now-2y", "to": "now" },
|
||||
"timepicker": {},
|
||||
"timezone": "browser",
|
||||
"title": "UK Payslip",
|
||||
"uid": "uk-payslip",
|
||||
"version": 1
|
||||
}
|
||||
|
|
@ -73,12 +73,12 @@ alertmanager:
|
|||
- source_matchers:
|
||||
- alertname = NodeDown
|
||||
target_matchers:
|
||||
- alertname =~ "NodeNotReady|NodeConditionBad|PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|NodeLowFreeMemory|PostgreSQLDown|MySQLDown|RedisDown|HeadscaleDown|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|NodeExporterDown|DockerRegistryDown|HomeAssistantDown|CloudflaredDown|TechnitiumDNSDown|iDRACRedfishMetricsMissing|iDRACSNMPMetricsMissing|HomeAssistantMetricsMissing"
|
||||
- alertname =~ "NodeNotReady|NodeConditionBad|PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|NodeLowFreeMemory|PostgreSQLDown|RedisDown|HeadscaleDown|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|NodeExporterDown|DockerRegistryDown|HomeAssistantDown|CloudflaredDown|TechnitiumDNSDown|iDRACRedfishMetricsMissing|iDRACSNMPMetricsMissing|HomeAssistantMetricsMissing"
|
||||
# NFS down causes mass pod failures and NFS-dependent service outages
|
||||
- source_matchers:
|
||||
- alertname = NFSServerUnresponsive
|
||||
target_matchers:
|
||||
- alertname =~ "PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|PostgreSQLDown|MySQLDown|RedisDown|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|HomeAssistantDown"
|
||||
- alertname =~ "PodCrashLooping|ContainerOOMKilled|DeploymentReplicasMismatch|StatefulSetReplicasMismatch|DaemonSetMissingPods|ScrapeTargetDown|PostgreSQLDown|RedisDown|AuthentikDown|PoisonFountainDown|HackmdDown|PrivatebinDown|MailServerDown|EmailRoundtripFailing|EmailRoundtripStale|HomeAssistantDown"
|
||||
# Traefik down makes service-level alerts noise
|
||||
- source_matchers:
|
||||
- alertname = TraefikDown
|
||||
|
|
@ -1340,13 +1340,6 @@ serverFiles:
|
|||
severity: critical
|
||||
annotations:
|
||||
summary: "PostgreSQL pod {{ $labels.pod }} is not ready"
|
||||
- alert: MySQLDown
|
||||
expr: kube_statefulset_status_replicas_ready{namespace="dbaas", statefulset="mysql-cluster"} < 1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "MySQL InnoDB Cluster has no ready replicas"
|
||||
- alert: RedisDown
|
||||
expr: kube_statefulset_status_replicas_ready{namespace="redis", statefulset="redis-node"} < 1
|
||||
for: 5m
|
||||
|
|
@ -1391,13 +1384,6 @@ serverFiles:
|
|||
severity: warning
|
||||
annotations:
|
||||
summary: "CNPG operator down — PostgreSQL failover/management degraded"
|
||||
- alert: MySQLOperatorDown
|
||||
expr: (kube_deployment_status_replicas_available{namespace="mysql-operator", deployment="mysql-operator"} or on() vector(0)) < 1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "MySQL operator down — InnoDB Cluster management degraded"
|
||||
- name: Cluster
|
||||
rules:
|
||||
- alert: NodeDown
|
||||
|
|
@ -1755,6 +1741,38 @@ serverFiles:
|
|||
severity: warning
|
||||
annotations:
|
||||
summary: "Email round-trip monitor never reported - check CronJob in mailserver namespace"
|
||||
- alert: ClaudeOAuthTokenExpiringSoon
|
||||
expr: (claude_oauth_token_expiry_timestamp{job="claude-oauth-expiry-monitor"} - time()) < (30 * 86400)
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Claude OAuth token {{ $labels.path }} expires in <30 days"
|
||||
description: "Run `claude setup-token` to mint a new 1-year token and update the corresponding Vault path + mint_epoch in stacks/claude-agent-service/main.tf."
|
||||
- alert: ClaudeOAuthTokenCritical
|
||||
expr: (claude_oauth_token_expiry_timestamp{job="claude-oauth-expiry-monitor"} - time()) < (7 * 86400)
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Claude OAuth token {{ $labels.path }} expires in <7 days — rotate NOW"
|
||||
description: "The long-lived CLAUDE_CODE_OAUTH_TOKEN is within 1 week of expiry. Automated upgrades will break when it expires. Harvest via `claude setup-token` and update Vault + TF."
|
||||
- alert: ClaudeOAuthTokenMonitorStale
|
||||
expr: (time() - claude_oauth_expiry_monitor_last_push_timestamp) > (48 * 3600)
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Claude OAuth expiry monitor hasn't pushed in >48h"
|
||||
description: "CronJob claude-oauth-expiry-monitor in claude-agent ns isn't running. Check `kubectl -n claude-agent get cronjob claude-oauth-expiry-monitor`."
|
||||
- alert: ClaudeOAuthTokenMonitorNeverRun
|
||||
expr: absent(claude_oauth_expiry_monitor_last_push_timestamp)
|
||||
for: 2h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Claude OAuth expiry monitor has never pushed — CronJob not running"
|
||||
description: "Expected `claude_oauth_expiry_monitor_last_push_timestamp` to appear once the CronJob runs. Check the CronJob in claude-agent namespace."
|
||||
- alert: HackmdDown
|
||||
expr: (kube_deployment_status_replicas_available{namespace="hackmd"} or on() vector(0)) < 1
|
||||
for: 5m
|
||||
|
|
@ -1889,6 +1907,38 @@ serverFiles:
|
|||
severity: warning
|
||||
annotations:
|
||||
summary: "{{ $value | printf \"%.0f\" }} service(s) externally unreachable but internally healthy — check Cloudflare tunnel, DNS, or Traefik routing"
|
||||
- name: "Authentik Outpost"
|
||||
# Guards against the 2026-04-18 incident where /dev/shm filled with
|
||||
# gorilla/sessions FileStore files (~44k files at ~1.5KB each) and the
|
||||
# outpost returned HTTP 400 on every forward-auth request.
|
||||
# See docs/post-mortems/2026-04-18-authentik-outpost-shm-full.md.
|
||||
rules:
|
||||
- alert: AuthentikOutpostMemoryHigh
|
||||
# Working set includes /dev/shm tmpfs contents (session files).
|
||||
# sizeLimit on the outpost emptyDir is 2Gi; warn at 75% to leave
|
||||
# plenty of headroom for mitigation before ENOSPC.
|
||||
expr: container_memory_working_set_bytes{namespace="authentik", pod=~"ak-outpost-.*", container="proxy"} > 1.5 * 1024 * 1024 * 1024
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Authentik outpost working set {{ $value | humanize1024 }} — /dev/shm may be filling with session files (threshold 1.5 GiB of 2 GiB sizeLimit)"
|
||||
- alert: AuthentikOutpostMemoryCritical
|
||||
expr: container_memory_working_set_bytes{namespace="authentik", pod=~"ak-outpost-.*", container="proxy"} > 1.8 * 1024 * 1024 * 1024
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Authentik outpost near /dev/shm fill ({{ $value | humanize1024 }}) — imminent forward-auth failure. Restart pod: kubectl -n authentik delete pod -l goauthentik.io/outpost-name=authentik-embedded-outpost"
|
||||
- alert: AuthentikOutpostRestarts
|
||||
# Pod restarts on a stateless outpost usually mean OOM or crash.
|
||||
# Normal is 0; we expect one manual rollout per incident/upgrade.
|
||||
expr: increase(kube_pod_container_status_restarts_total{namespace="authentik", pod=~"ak-outpost-.*"}[30m]) > 2
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Authentik outpost restarted {{ $value | printf \"%.0f\" }} times in 30m — check for OOM or crash loop"
|
||||
|
||||
extraScrapeConfigs: |
|
||||
- job_name: 'proxmox-host'
|
||||
|
|
|
|||
|
|
@ -47,6 +47,35 @@ resource "kubernetes_manifest" "external_secret" {
|
|||
depends_on = [kubernetes_namespace.n8n]
|
||||
}
|
||||
|
||||
resource "kubernetes_manifest" "external_secret_claude_agent" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
kind = "ExternalSecret"
|
||||
metadata = {
|
||||
name = "claude-agent-token"
|
||||
namespace = "n8n"
|
||||
}
|
||||
spec = {
|
||||
refreshInterval = "15m"
|
||||
secretStoreRef = {
|
||||
name = "vault-kv"
|
||||
kind = "ClusterSecretStore"
|
||||
}
|
||||
target = {
|
||||
name = "claude-agent-token"
|
||||
}
|
||||
data = [{
|
||||
secretKey = "api_bearer_token"
|
||||
remoteRef = {
|
||||
key = "claude-agent-service"
|
||||
property = "api_bearer_token"
|
||||
}
|
||||
}]
|
||||
}
|
||||
}
|
||||
depends_on = [kubernetes_namespace.n8n]
|
||||
}
|
||||
|
||||
resource "kubernetes_persistent_volume_claim" "data_encrypted" {
|
||||
wait_until_bound = false
|
||||
metadata {
|
||||
|
|
@ -207,6 +236,19 @@ resource "kubernetes_deployment" "n8n" {
|
|||
name = "WEBHOOK_URL"
|
||||
value = "https://n8n.viktorbarzin.me"
|
||||
}
|
||||
env {
|
||||
name = "CLAUDE_AGENT_API_TOKEN"
|
||||
value_from {
|
||||
secret_key_ref {
|
||||
name = "claude-agent-token"
|
||||
key = "api_bearer_token"
|
||||
}
|
||||
}
|
||||
}
|
||||
env {
|
||||
name = "N8N_BLOCK_ENV_ACCESS_IN_NODE"
|
||||
value = "false"
|
||||
}
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/home/node/.n8n"
|
||||
|
|
|
|||
|
|
@ -38,15 +38,25 @@
|
|||
},
|
||||
{
|
||||
"parameters": {
|
||||
"command": "='claude -p \"You are the service-upgrade agent. Read /home/wizard/code/infra/.claude/agents/service-upgrade.md for full instructions.\\n\\nUpgrade task:\\n- Image: ' + $json.body.diun_entry_image + '\\n- New tag: ' + $json.body.diun_entry_imagetag + '\\n- Hub link: ' + ($json.body.diun_entry_hublink || 'none') + '\\n\\nExecute the upgrade workflow now.\"'",
|
||||
"cwd": "/home/wizard/code/infra"
|
||||
"method": "POST",
|
||||
"url": "http://claude-agent-service.claude-agent.svc.cluster.local:8080/execute",
|
||||
"sendHeaders": true,
|
||||
"headerParameters": {
|
||||
"parameters": [
|
||||
{"name": "Authorization", "value": "=Bearer {{ $env.CLAUDE_AGENT_API_TOKEN }}"},
|
||||
{"name": "Content-Type", "value": "application/json"}
|
||||
]
|
||||
},
|
||||
"sendBody": true,
|
||||
"specifyBody": "json",
|
||||
"jsonBody": "={{ JSON.stringify({ prompt: 'You are the service-upgrade agent. Read .claude/agents/service-upgrade.md for full instructions.\\n\\nUpgrade task:\\n- Image: ' + $json.body.diun_entry_image + '\\n- New tag: ' + $json.body.diun_entry_imagetag + '\\n- Hub link: ' + ($json.body.diun_entry_hublink || 'none') + '\\n\\nExecute the upgrade workflow now.', agent: '.claude/agents/service-upgrade', max_budget_usd: 10, timeout_seconds: 1800 }) }}",
|
||||
"options": {}
|
||||
},
|
||||
"id": "ssh-execute",
|
||||
"id": "http-execute",
|
||||
"name": "Run Upgrade Agent",
|
||||
"type": "n8n-nodes-base.ssh",
|
||||
"typeVersion": 1,
|
||||
"position": [910, 300],
|
||||
"credentials": {"sshPassword": {"id": "REPLACE_WITH_SSH_CRED_ID", "name": "Dev VM SSH"}}
|
||||
"type": "n8n-nodes-base.httpRequest",
|
||||
"typeVersion": 4.2,
|
||||
"position": [910, 300]
|
||||
}
|
||||
],
|
||||
"connections": {
|
||||
|
|
|
|||
|
|
@ -80,6 +80,7 @@ resource "kubernetes_deployment" "novelapp" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
# DRIFT_WORKAROUND: CI pipeline owns image tag (kubectl set image from Woodpecker/GHA). Reviewed 2026-04-18.
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].container[0].image,
|
||||
]
|
||||
|
|
|
|||
73
stacks/ollama/.terraform.lock.hcl
generated
73
stacks/ollama/.terraform.lock.hcl
generated
|
|
@ -1,73 +0,0 @@
|
|||
# This file is maintained automatically by "terraform init".
|
||||
# Manual edits may be lost in future updates.
|
||||
|
||||
provider "registry.terraform.io/cloudflare/cloudflare" {
|
||||
version = "4.52.7"
|
||||
constraints = "~> 4.0"
|
||||
hashes = [
|
||||
"h1:pPItIWii5oymR+geZB219ROSPuSODPLTlM4S/u8xLvM=",
|
||||
"zh:0c904ce31a4c6c4a5b3bf7ff1560e77c0cc7e2450c8553ded8e8c90398e1418b",
|
||||
"zh:36183d310c36373fe4cb936b83c595c6fd3b0a94bc7827f28e5789ccbf59752e",
|
||||
"zh:556a568a6f0235e8f41647de9e4d3a1e7b1d6502df8b19b54ec441f1c653ea10",
|
||||
"zh:633ebbd5b0245e75e500ef9be4d9e62288f97e8da3baaa51323892a786d90285",
|
||||
"zh:6acfe60cf52a65ba8f044f748548d2119e7f4fd7f8ebcb14698960d87c68f529",
|
||||
"zh:890df766e9b839623b1f0437355032a3c006226a6c200cd911e15ee1a9014e9f",
|
||||
"zh:904acc31ebb9d6ef68c792074b30532ee61bf515f19e0a3c75b46f126cca1f13",
|
||||
"zh:a1d0a81246afc8750286d3f6fe7a8fbe6460dd2662407b28dbfbabb612e5fa9d",
|
||||
"zh:a41a36fe253fc365fe2b7ffc749624688b2693b4634862fda161179ab100029f",
|
||||
"zh:a7ef269e77ffa8715c8945a2c14322c7ff159ea44c15f62505f3cbb2cae3b32d",
|
||||
"zh:b01aa3bed30610633b762df64332b26f8844a68c3960cebcb30f04918efc67fe",
|
||||
"zh:b069cc2cd18cae10757df3ae030508eac8d55de7e49eda7a5e3e11f2f7fe6455",
|
||||
"zh:b2d2c6313729ebb7465dceece374049e2d08bda34473901be9ff46a8836d42b2",
|
||||
"zh:db0e114edaf4bc2f3d4769958807c83022bfbc619a00bdf4c4bd17faa4ab2d8b",
|
||||
"zh:ecc0aa8b9044f664fd2aaf8fa992d976578f78478980555b4b8f6148e8d1a5fe",
|
||||
]
|
||||
}
|
||||
|
||||
provider "registry.terraform.io/hashicorp/helm" {
|
||||
version = "3.1.1"
|
||||
hashes = [
|
||||
"h1:47CqNwkxctJtL/N/JuEj+8QMg8mRNI/NWeKO5/ydfZU=",
|
||||
"h1:5b2ojWKT0noujHiweCds37ZreRFRQLNaErdJLusJN88=",
|
||||
"zh:1a6d5ce931708aec29d1f3d9e360c2a0c35ba5a54d03eeaff0ce3ca597cd0275",
|
||||
"zh:3411919ba2a5941801e677f0fea08bdd0ae22ba3c9ce3309f55554699e06524a",
|
||||
"zh:81b36138b8f2320dc7f877b50f9e38f4bc614affe68de885d322629dd0d16a29",
|
||||
"zh:95a2a0a497a6082ee06f95b38bd0f0d6924a65722892a856cfd914c0d117f104",
|
||||
"zh:9d3e78c2d1bb46508b972210ad706dd8c8b106f8b206ecf096cd211c54f46990",
|
||||
"zh:a79139abf687387a6efdbbb04289a0a8e7eaca2bd91cdc0ce68ea4f3286c2c34",
|
||||
"zh:aaa8784be125fbd50c48d84d6e171d3fb6ef84a221dbc5165c067ce05faab4c8",
|
||||
"zh:afecd301f469975c9d8f350cc482fe656e082b6ab0f677d1a816c3c615837cc1",
|
||||
"zh:c54c22b18d48ff9053d899d178d9ffef7d9d19785d9bf310a07d648b7aac075b",
|
||||
"zh:db2eefd55aea48e73384a555c72bac3f7d428e24147bedb64e1a039398e5b903",
|
||||
"zh:ee61666a233533fd2be971091cecc01650561f1585783c381b6f6e8a390198a4",
|
||||
"zh:f569b65999264a9416862bca5cd2a6177d94ccb0424f3a4ef424428912b9cb3c",
|
||||
]
|
||||
}
|
||||
|
||||
provider "registry.terraform.io/hashicorp/kubernetes" {
|
||||
version = "3.1.0"
|
||||
hashes = [
|
||||
"h1:oodIAuFMikXNmEtil5MQgP4dfSctUBYQiGJfjbsF3NY=",
|
||||
]
|
||||
}
|
||||
|
||||
provider "registry.terraform.io/hashicorp/vault" {
|
||||
version = "4.8.0"
|
||||
constraints = "~> 4.0"
|
||||
hashes = [
|
||||
"h1:GPfhH6dr1LY0foPBDYv9bEGifx7eSwYqFcEAOWOUxLk=",
|
||||
"h1:aHqgWQhDBMeZO9iUKwJYMlh4q+xNMUlMIcjRbF4d02Y=",
|
||||
"zh:269ab13433f67684012ae7e15876532b0312f5d0d2002a9cf9febb1279ce5ea6",
|
||||
"zh:4babc95bf0c40eb85005db1dc2ca403c46be4a71dd3e409db3711a56f7a5ca0e",
|
||||
"zh:78d5eefdd9e494defcb3c68d282b8f96630502cac21d1ea161f53cfe9bb483b3",
|
||||
"zh:86e27c1c625ecc24446a11eeffc3ac319b36c2b4e51251db8579256a0dbcf136",
|
||||
"zh:a32f31da94824009e26b077374440b52098aecb93c92ff55dc3d31dd37c4ea25",
|
||||
"zh:be0a18c6c0425518bab4fbffd82078b82036a88503b5d76064de551c9f646cbf",
|
||||
"zh:be5a77fdfd36863ebeec79cd12b1d13322ffad6821d157a0b279789fa06b5937",
|
||||
"zh:be8317d142a3caad74c7d936039ae27076a1b2b8312ef5208e2871a5f525977c",
|
||||
"zh:c94a84895a3d9954b80e983eed4603330a5cdbbd8eef5b3c99278c2d1402ef3c",
|
||||
"zh:de1fb712784dd8415f011ca5346a34f87fab6046c730557615247e511dbc7d98",
|
||||
"zh:e3eafae7da550f86cae395d6660b2a0e93ec8d2b0e0e5ef982ec762e961fc952",
|
||||
"zh:ff35fb1ab6add288f0f368981e56f780b50405accd1937131cba1137999c8d83",
|
||||
]
|
||||
}
|
||||
|
|
@ -1,7 +0,0 @@
|
|||
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
|
||||
terraform {
|
||||
backend "pg" {
|
||||
conn_str = "postgres://terraform_state:SBlzGxotNUN6HH9d0S-m@10.0.20.200:5432/terraform_state?sslmode=disable"
|
||||
schema_name = "ollama"
|
||||
}
|
||||
}
|
||||
|
|
@ -1,380 +0,0 @@
|
|||
variable "tls_secret_name" {
|
||||
type = string
|
||||
sensitive = true
|
||||
}
|
||||
variable "nfs_server" { type = string }
|
||||
variable "ollama_host" { type = string }
|
||||
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
kind = "ExternalSecret"
|
||||
metadata = {
|
||||
name = "ollama-secrets"
|
||||
namespace = "ollama"
|
||||
}
|
||||
spec = {
|
||||
refreshInterval = "15m"
|
||||
secretStoreRef = {
|
||||
name = "vault-kv"
|
||||
kind = "ClusterSecretStore"
|
||||
}
|
||||
target = {
|
||||
name = "ollama-secrets"
|
||||
}
|
||||
dataFrom = [{
|
||||
extract = {
|
||||
key = "ollama"
|
||||
}
|
||||
}]
|
||||
}
|
||||
}
|
||||
depends_on = [kubernetes_namespace.ollama]
|
||||
}
|
||||
|
||||
data "kubernetes_secret" "eso_secrets" {
|
||||
metadata {
|
||||
name = "ollama-secrets"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
}
|
||||
depends_on = [kubernetes_manifest.external_secret]
|
||||
}
|
||||
|
||||
locals {
|
||||
api_credentials = jsondecode(data.kubernetes_secret.eso_secrets.data["api_credentials"])
|
||||
}
|
||||
|
||||
|
||||
resource "kubernetes_namespace" "ollama" {
|
||||
metadata {
|
||||
name = "ollama"
|
||||
labels = {
|
||||
tier = local.tiers.gpu
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
module "tls_secret" {
|
||||
source = "../../modules/kubernetes/setup_tls_secret"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
tls_secret_name = var.tls_secret_name
|
||||
}
|
||||
|
||||
module "nfs_ollama_data_host" {
|
||||
source = "../../modules/kubernetes/nfs_volume"
|
||||
name = "ollama-data-host"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
nfs_server = "192.168.1.127"
|
||||
nfs_path = "/srv/nfs-ssd/ollama"
|
||||
}
|
||||
|
||||
resource "kubernetes_persistent_volume_claim" "ollama_ui_data_proxmox" {
|
||||
wait_until_bound = false
|
||||
metadata {
|
||||
name = "ollama-ui-data-proxmox"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
annotations = {
|
||||
"resize.topolvm.io/threshold" = "80%"
|
||||
"resize.topolvm.io/increase" = "100%"
|
||||
"resize.topolvm.io/storage_limit" = "5Gi"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
access_modes = ["ReadWriteOnce"]
|
||||
storage_class_name = "proxmox-lvm"
|
||||
resources {
|
||||
requests = {
|
||||
storage = "1Gi"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# resource "helm_release" "ollama" {
|
||||
# namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
# name = "ollama"
|
||||
|
||||
# repository = "https://otwld.github.io/ollama-helm/"
|
||||
# chart = "ollama"
|
||||
# atomic = true
|
||||
|
||||
# values = [templatefile("${path.module}/values.yaml", {})]
|
||||
# timeout = 2400
|
||||
# }
|
||||
|
||||
|
||||
resource "kubernetes_deployment" "ollama" {
|
||||
metadata {
|
||||
name = "ollama"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
labels = {
|
||||
app = "ollama"
|
||||
tier = local.tiers.gpu
|
||||
}
|
||||
}
|
||||
spec {
|
||||
replicas = 0 # Scaled down — low usage, saves resources + clears ExternalAccessDivergence alert
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "ollama"
|
||||
}
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "ollama"
|
||||
}
|
||||
annotations = {
|
||||
"diun.enable" = "true"
|
||||
"diun.include_tags" = "^\\d+\\.\\d+\\.\\d+$"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
node_selector = {
|
||||
"gpu" = "true"
|
||||
}
|
||||
toleration {
|
||||
key = "nvidia.com/gpu"
|
||||
value = "true"
|
||||
effect = "NoSchedule"
|
||||
}
|
||||
container {
|
||||
image = "ollama/ollama:0.6.8"
|
||||
name = "ollama"
|
||||
env {
|
||||
name = "OLLAMA_HOST"
|
||||
value = "0.0.0.0:11434"
|
||||
}
|
||||
env {
|
||||
name = "PATH"
|
||||
value = "/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
|
||||
}
|
||||
env {
|
||||
name = "OLLAMA_KEEP_ALIVE"
|
||||
value = "1h"
|
||||
}
|
||||
|
||||
port {
|
||||
container_port = 11434
|
||||
}
|
||||
volume_mount {
|
||||
name = "ollama-data"
|
||||
mount_path = "/root/.ollama"
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "100m"
|
||||
memory = "256Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "256Mi"
|
||||
"nvidia.com/gpu" = "1"
|
||||
}
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "ollama-data"
|
||||
persistent_volume_claim {
|
||||
claim_name = module.nfs_ollama_data_host.claim_name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "ollama" {
|
||||
metadata {
|
||||
name = "ollama"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
labels = {
|
||||
app = "ollama"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
selector = {
|
||||
app = "ollama"
|
||||
}
|
||||
port {
|
||||
name = "http"
|
||||
port = 11434
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Allow ollama to be connected to from external apps (internal LAN only)
|
||||
module "ollama-ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
name = "ollama-server"
|
||||
service_name = "ollama"
|
||||
root_domain = "viktorbarzin.lan"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
allow_local_access_only = true
|
||||
ssl_redirect = false
|
||||
port = 11434
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "false"
|
||||
}
|
||||
}
|
||||
|
||||
# Ollama API ingress for external access (basicAuth protected)
|
||||
locals {
|
||||
ollama_api_htpasswd = join("\n", [for name, pass in local.api_credentials : "${name}:${bcrypt(pass, 10)}"])
|
||||
}
|
||||
|
||||
resource "kubernetes_secret" "ollama_api_basic_auth" {
|
||||
metadata {
|
||||
name = "ollama-api-basic-auth-secret"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
}
|
||||
|
||||
data = {
|
||||
auth = local.ollama_api_htpasswd
|
||||
}
|
||||
|
||||
type = "Opaque"
|
||||
lifecycle {
|
||||
ignore_changes = [data]
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_manifest" "ollama_api_basic_auth_middleware" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "Middleware"
|
||||
metadata = {
|
||||
name = "ollama-api-basic-auth"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
}
|
||||
spec = {
|
||||
basicAuth = {
|
||||
secret = kubernetes_secret.ollama_api_basic_auth.metadata[0].name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
module "ollama-api-ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
dns_type = "non-proxied"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
name = "ollama-api"
|
||||
service_name = "ollama"
|
||||
root_domain = "viktorbarzin.me"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
ssl_redirect = true
|
||||
port = 11434
|
||||
extra_annotations = {
|
||||
"traefik.ingress.kubernetes.io/router.middlewares" = "ollama-ollama-api-basic-auth@kubernetescrd,traefik-rate-limit@kubernetescrd,traefik-crowdsec@kubernetescrd"
|
||||
"gethomepage.dev/enabled" = "false"
|
||||
}
|
||||
}
|
||||
|
||||
# Web UI
|
||||
resource "kubernetes_deployment" "ollama-ui" {
|
||||
metadata {
|
||||
name = "ollama-ui"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
labels = {
|
||||
app = "ollama-ui"
|
||||
tier = local.tiers.gpu
|
||||
}
|
||||
}
|
||||
spec {
|
||||
# Disabled: reduce cluster memory pressure (2026-03-14 OOM incident)
|
||||
replicas = 0
|
||||
strategy {
|
||||
type = "Recreate"
|
||||
}
|
||||
selector {
|
||||
match_labels = {
|
||||
app = "ollama-ui"
|
||||
}
|
||||
}
|
||||
template {
|
||||
metadata {
|
||||
labels = {
|
||||
app = "ollama-ui"
|
||||
}
|
||||
annotations = {
|
||||
"dependency.kyverno.io/wait-for" = "ollama.ollama:11434"
|
||||
}
|
||||
}
|
||||
spec {
|
||||
container {
|
||||
# image = "ghcr.io/open-webui/open-webui:main"
|
||||
image = "ghcr.io/open-webui/open-webui:v0.8.12"
|
||||
name = "ollama-ui"
|
||||
env {
|
||||
name = "OLLAMA_BASE_URL"
|
||||
value = "http://${var.ollama_host}:11434"
|
||||
}
|
||||
|
||||
port {
|
||||
container_port = 8080
|
||||
}
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/app/backend/data"
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "25m"
|
||||
memory = "256Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "256Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "data"
|
||||
persistent_volume_claim {
|
||||
claim_name = kubernetes_persistent_volume_claim.ollama_ui_data_proxmox.metadata[0].name
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_service" "ollama-ui" {
|
||||
metadata {
|
||||
name = "ollama-ui"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
labels = {
|
||||
app = "dashy"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
selector = {
|
||||
app = "ollama-ui"
|
||||
}
|
||||
port {
|
||||
name = "http"
|
||||
port = 80
|
||||
target_port = 8080
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
module "ingress" {
|
||||
source = "../../modules/kubernetes/ingress_factory"
|
||||
dns_type = "proxied"
|
||||
namespace = kubernetes_namespace.ollama.metadata[0].name
|
||||
name = "ollama"
|
||||
service_name = "ollama-ui"
|
||||
tls_secret_name = var.tls_secret_name
|
||||
port = 80
|
||||
extra_annotations = {
|
||||
"gethomepage.dev/enabled" = "true"
|
||||
"gethomepage.dev/name" = "Ollama"
|
||||
"gethomepage.dev/description" = "Local LLM inference"
|
||||
"gethomepage.dev/icon" = "ollama.png"
|
||||
"gethomepage.dev/group" = "AI & Data"
|
||||
"gethomepage.dev/pod-selector" = ""
|
||||
}
|
||||
}
|
||||
|
|
@ -1,33 +0,0 @@
|
|||
# Generated by Terragrunt. Sig: nIlQXj57tbuaRZEa
|
||||
terraform {
|
||||
required_providers {
|
||||
vault = {
|
||||
source = "hashicorp/vault"
|
||||
version = "~> 4.0"
|
||||
}
|
||||
cloudflare = {
|
||||
source = "cloudflare/cloudflare"
|
||||
version = "~> 4"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
variable "kube_config_path" {
|
||||
type = string
|
||||
default = "~/.kube/config"
|
||||
}
|
||||
|
||||
provider "kubernetes" {
|
||||
config_path = var.kube_config_path
|
||||
}
|
||||
|
||||
provider "helm" {
|
||||
kubernetes = {
|
||||
config_path = var.kube_config_path
|
||||
}
|
||||
}
|
||||
|
||||
provider "vault" {
|
||||
address = "https://vault.viktorbarzin.me"
|
||||
skip_child_token = true
|
||||
}
|
||||
|
|
@ -1 +0,0 @@
|
|||
../../secrets
|
||||
|
|
@ -1,28 +0,0 @@
|
|||
ollama:
|
||||
gpu:
|
||||
# -- Enable GPU integration
|
||||
enabled: true
|
||||
|
||||
# -- GPU type: 'nvidia' or 'amd'
|
||||
type: "nvidia"
|
||||
|
||||
# -- Specify the number of GPU to 1
|
||||
number: 1
|
||||
|
||||
# -- List of models to pull at container startup
|
||||
models:
|
||||
pull:
|
||||
- llama3
|
||||
|
||||
persistentVolume:
|
||||
enabled: true
|
||||
existingClaim: "ollama-pvc"
|
||||
|
||||
nodeSelector:
|
||||
gpu: "true"
|
||||
|
||||
tolerations:
|
||||
- key: "nvidia.com/gpu"
|
||||
operator: "Equal"
|
||||
value: "true"
|
||||
effect: "NoSchedule"
|
||||
|
|
@ -1175,7 +1175,7 @@ resource "kubernetes_deployment" "openlobster" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -77,6 +77,7 @@ resource "kubernetes_secret" "basic_auth" {
|
|||
|
||||
type = "Opaque"
|
||||
lifecycle {
|
||||
# DRIFT_WORKAROUND: htpasswd bcrypt hashes are non-deterministic per apply; would cause perpetual diff. Reviewed 2026-04-18.
|
||||
ignore_changes = [data]
|
||||
}
|
||||
}
|
||||
|
|
|
|||
330
stacks/payslip-ingest/main.tf
Normal file
330
stacks/payslip-ingest/main.tf
Normal file
|
|
@ -0,0 +1,330 @@
|
|||
variable "image_tag" {
|
||||
type = string
|
||||
default = "latest"
|
||||
description = "payslip-ingest image tag. Use 8-char git SHA in CI; :latest only for local trials."
|
||||
}
|
||||
|
||||
variable "postgresql_host" { type = string }
|
||||
|
||||
locals {
|
||||
namespace = "payslip-ingest"
|
||||
image = "registry.viktorbarzin.me/payslip-ingest:${var.image_tag}"
|
||||
labels = {
|
||||
app = "payslip-ingest"
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_namespace" "payslip_ingest" {
|
||||
metadata {
|
||||
name = local.namespace
|
||||
labels = {
|
||||
tier = local.tiers.aux
|
||||
"istio-injection" = "disabled"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# App secrets sourced from multiple Vault KV keys.
|
||||
# Seed these manually in Vault before applying:
|
||||
# secret/paperless-ngx -> property `api_token`
|
||||
# secret/claude-agent-service -> property `api_bearer_token`
|
||||
# secret/payslip-ingest -> property `webhook_bearer_token`
|
||||
resource "kubernetes_manifest" "external_secret" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
kind = "ExternalSecret"
|
||||
metadata = {
|
||||
name = "payslip-ingest-secrets"
|
||||
namespace = local.namespace
|
||||
}
|
||||
spec = {
|
||||
refreshInterval = "15m"
|
||||
secretStoreRef = {
|
||||
name = "vault-kv"
|
||||
kind = "ClusterSecretStore"
|
||||
}
|
||||
target = {
|
||||
name = "payslip-ingest-secrets"
|
||||
template = {
|
||||
metadata = {
|
||||
annotations = {
|
||||
"reloader.stakater.com/match" = "true"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
data = [
|
||||
{
|
||||
secretKey = "PAPERLESS_API_TOKEN"
|
||||
remoteRef = {
|
||||
key = "paperless-ngx"
|
||||
property = "api_token"
|
||||
}
|
||||
},
|
||||
{
|
||||
secretKey = "CLAUDE_AGENT_BEARER_TOKEN"
|
||||
remoteRef = {
|
||||
key = "claude-agent-service"
|
||||
property = "api_bearer_token"
|
||||
}
|
||||
},
|
||||
{
|
||||
secretKey = "WEBHOOK_BEARER_TOKEN"
|
||||
remoteRef = {
|
||||
key = "payslip-ingest"
|
||||
property = "webhook_bearer_token"
|
||||
}
|
||||
},
|
||||
]
|
||||
}
|
||||
}
|
||||
depends_on = [kubernetes_namespace.payslip_ingest]
|
||||
}
|
||||
|
||||
# DB credentials from Vault database engine (rotated every 7 days).
|
||||
# Template builds the asyncpg DSN consumed by the FastAPI app as DB_CONNECTION_STRING.
|
||||
resource "kubernetes_manifest" "db_external_secret" {
|
||||
manifest = {
|
||||
apiVersion = "external-secrets.io/v1beta1"
|
||||
kind = "ExternalSecret"
|
||||
metadata = {
|
||||
name = "payslip-ingest-db-creds"
|
||||
namespace = local.namespace
|
||||
}
|
||||
spec = {
|
||||
refreshInterval = "15m"
|
||||
secretStoreRef = {
|
||||
name = "vault-database"
|
||||
kind = "ClusterSecretStore"
|
||||
}
|
||||
target = {
|
||||
name = "payslip-ingest-db-creds"
|
||||
template = {
|
||||
metadata = {
|
||||
annotations = {
|
||||
"reloader.stakater.com/match" = "true"
|
||||
}
|
||||
}
|
||||
data = {
|
||||
DB_CONNECTION_STRING = "postgresql+asyncpg://payslip_ingest:{{ .password }}@${var.postgresql_host}:5432/payslip_ingest"
|
||||
DB_PASSWORD = "{{ .password }}"
|
||||
}
|
||||
}
|
||||
}
|
||||
data = [{
|
||||
secretKey = "password"
|
||||
remoteRef = {
|
||||
key = "static-creds/pg-payslip-ingest"
|
||||
property = "password"
|
||||
}
|
||||
}]
|
||||
}
|
||||
}
|
||||
depends_on = [kubernetes_namespace.payslip_ingest]
|
||||
}
|
||||
|
||||
resource "kubernetes_deployment" "payslip_ingest" {
|
||||
metadata {
|
||||
name = "payslip-ingest"
|
||||
namespace = kubernetes_namespace.payslip_ingest.metadata[0].name
|
||||
labels = merge(local.labels, {
|
||||
tier = local.tiers.aux
|
||||
})
|
||||
annotations = {
|
||||
"reloader.stakater.com/search" = "true"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
replicas = 1
|
||||
strategy {
|
||||
type = "Recreate"
|
||||
}
|
||||
|
||||
selector {
|
||||
match_labels = local.labels
|
||||
}
|
||||
|
||||
template {
|
||||
metadata {
|
||||
labels = local.labels
|
||||
annotations = {
|
||||
"dependency.kyverno.io/wait-for" = "postgresql.dbaas:5432"
|
||||
}
|
||||
}
|
||||
|
||||
spec {
|
||||
image_pull_secrets {
|
||||
name = "registry-credentials"
|
||||
}
|
||||
|
||||
init_container {
|
||||
name = "alembic-migrate"
|
||||
image = local.image
|
||||
command = ["python", "-m", "payslip_ingest", "migrate"]
|
||||
|
||||
env_from {
|
||||
secret_ref {
|
||||
name = "payslip-ingest-secrets"
|
||||
}
|
||||
}
|
||||
env_from {
|
||||
secret_ref {
|
||||
name = "payslip-ingest-db-creds"
|
||||
}
|
||||
}
|
||||
|
||||
env {
|
||||
name = "PAPERLESS_URL"
|
||||
value = "http://paperless-ngx.paperless-ngx.svc.cluster.local"
|
||||
}
|
||||
env {
|
||||
name = "CLAUDE_AGENT_URL"
|
||||
value = "http://claude-agent-service.claude-agent.svc.cluster.local:8080"
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "50m"
|
||||
memory = "256Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "512Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
container {
|
||||
name = "payslip-ingest"
|
||||
image = local.image
|
||||
|
||||
port {
|
||||
container_port = 8080
|
||||
}
|
||||
|
||||
env_from {
|
||||
secret_ref {
|
||||
name = "payslip-ingest-secrets"
|
||||
}
|
||||
}
|
||||
env_from {
|
||||
secret_ref {
|
||||
name = "payslip-ingest-db-creds"
|
||||
}
|
||||
}
|
||||
|
||||
env {
|
||||
name = "PAPERLESS_URL"
|
||||
value = "http://paperless-ngx.paperless-ngx.svc.cluster.local"
|
||||
}
|
||||
env {
|
||||
name = "CLAUDE_AGENT_URL"
|
||||
value = "http://claude-agent-service.claude-agent.svc.cluster.local:8080"
|
||||
}
|
||||
|
||||
readiness_probe {
|
||||
http_get {
|
||||
path = "/healthz"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 5
|
||||
period_seconds = 10
|
||||
}
|
||||
|
||||
liveness_probe {
|
||||
http_get {
|
||||
path = "/healthz"
|
||||
port = 8080
|
||||
}
|
||||
initial_delay_seconds = 5
|
||||
period_seconds = 10
|
||||
}
|
||||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "50m"
|
||||
memory = "256Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "512Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
|
||||
depends_on = [
|
||||
kubernetes_manifest.external_secret,
|
||||
kubernetes_manifest.db_external_secret,
|
||||
]
|
||||
}
|
||||
|
||||
# ClusterIP-only — webhook is cluster-internal (paperless-ngx -> payslip-ingest).
|
||||
resource "kubernetes_service" "payslip_ingest" {
|
||||
metadata {
|
||||
name = "payslip-ingest"
|
||||
namespace = kubernetes_namespace.payslip_ingest.metadata[0].name
|
||||
labels = local.labels
|
||||
}
|
||||
|
||||
spec {
|
||||
type = "ClusterIP"
|
||||
selector = local.labels
|
||||
|
||||
port {
|
||||
name = "http"
|
||||
port = 8080
|
||||
target_port = 8080
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Plan-time read of the ESO-created K8s Secret for Grafana datasource password.
|
||||
# First apply: -target=kubernetes_manifest.db_external_secret first so the Secret exists.
|
||||
data "kubernetes_secret" "payslip_ingest_db_creds" {
|
||||
metadata {
|
||||
name = "payslip-ingest-db-creds"
|
||||
namespace = kubernetes_namespace.payslip_ingest.metadata[0].name
|
||||
}
|
||||
depends_on = [kubernetes_manifest.db_external_secret]
|
||||
}
|
||||
|
||||
# Grafana datasource for payslip_ingest PostgreSQL DB.
|
||||
# Lives in the monitoring namespace so the grafana sidecar (label grafana_datasource=1) picks it up.
|
||||
resource "kubernetes_config_map" "grafana_payslips_datasource" {
|
||||
metadata {
|
||||
name = "grafana-payslips-datasource"
|
||||
namespace = "monitoring"
|
||||
labels = {
|
||||
grafana_datasource = "1"
|
||||
}
|
||||
}
|
||||
data = {
|
||||
"payslips-datasource.yaml" = yamlencode({
|
||||
apiVersion = 1
|
||||
datasources = [{
|
||||
name = "Payslips"
|
||||
type = "postgres"
|
||||
access = "proxy"
|
||||
url = "${var.postgresql_host}:5432"
|
||||
database = "payslip_ingest"
|
||||
user = "payslip_ingest"
|
||||
uid = "payslips-pg"
|
||||
jsonData = {
|
||||
sslmode = "disable"
|
||||
postgresVersion = 1600
|
||||
timescaledb = false
|
||||
}
|
||||
secureJsonData = {
|
||||
password = data.kubernetes_secret.payslip_ingest_db_creds.data["DB_PASSWORD"]
|
||||
}
|
||||
editable = true
|
||||
}]
|
||||
})
|
||||
}
|
||||
}
|
||||
|
|
@ -11,3 +11,8 @@ dependency "vault" {
|
|||
config_path = "../vault"
|
||||
skip_outputs = true
|
||||
}
|
||||
|
||||
dependency "external-secrets" {
|
||||
config_path = "../external-secrets"
|
||||
skip_outputs = true
|
||||
}
|
||||
|
|
@ -197,7 +197,7 @@ resource "kubernetes_deployment" "phpipam_web" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -80,6 +80,7 @@ resource "kubernetes_deployment" "plotting-book" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
# DRIFT_WORKAROUND: CI pipeline owns image tag (kubectl set image from Woodpecker/GHA). Reviewed 2026-04-18.
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].container[0].image,
|
||||
]
|
||||
|
|
|
|||
|
|
@ -89,7 +89,7 @@ resource "kubernetes_deployment" "priority-pass" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
76
stacks/rybbit/worker/index.js
Normal file
76
stacks/rybbit/worker/index.js
Normal file
|
|
@ -0,0 +1,76 @@
|
|||
// Rybbit analytics injection via Cloudflare Worker
|
||||
// Injects the rybbit tracking script into HTML responses using HTMLRewriter.
|
||||
// Deployed as a route-based worker on *.viktorbarzin.me/*
|
||||
|
||||
// Site ID mapping: hostname → rybbit site ID
|
||||
// These were previously injected via Traefik's rewrite-body plugin (broken on v3.6).
|
||||
const SITE_IDS = {
|
||||
"viktorbarzin.me": "da853a2438d0",
|
||||
"www.viktorbarzin.me": "da853a2438d0",
|
||||
"actualbudget.viktorbarzin.me": "3e6b6b68088a",
|
||||
"crowdsec.viktorbarzin.me": "d09137795ccc",
|
||||
"cyberchef.viktorbarzin.me": "7c460afc68c4",
|
||||
"dawarich.viktorbarzin.me": "0abfd409f2fb",
|
||||
"pma.viktorbarzin.me": "942c76b8bd4d",
|
||||
"pgadmin.viktorbarzin.me": "7cef78e30485",
|
||||
"audiobookshelf.viktorbarzin.me": "17a5c7fbb077",
|
||||
"calibre.viktorbarzin.me": "ce5f8aed6bbb",
|
||||
"stacks.viktorbarzin.me": "b38fda4285df",
|
||||
"f1.viktorbarzin.me": "7e69786f66d5",
|
||||
"frigate.viktorbarzin.me": "0d4044069ff5",
|
||||
"highlights-immich.viktorbarzin.me": "602167601c6b",
|
||||
"immich.viktorbarzin.me": "35eedb7a3d2b",
|
||||
"mail.viktorbarzin.me": "082f164faa7d",
|
||||
"navidrome.viktorbarzin.me": "8a3844ff75ba",
|
||||
"networking-toolbox.viktorbarzin.me": "50e38577e41c",
|
||||
"nextcloud.viktorbarzin.me": "5a3bfe59a3fe",
|
||||
"paperless-ngx.viktorbarzin.me": "be6d140cbed8",
|
||||
"privatebin.viktorbarzin.me": "3ae810b0476d",
|
||||
"wrongmove.viktorbarzin.me": "edee05de453d",
|
||||
"rybbit.viktorbarzin.me": "3c476801a777",
|
||||
"send.viktorbarzin.me": "c1b8f8aa831b",
|
||||
"stirling-pdf.viktorbarzin.me": "a55ac54ec749",
|
||||
"uptime-kuma.viktorbarzin.me": "8fef77b1f7fe",
|
||||
"vaultwarden.viktorbarzin.me": "b8fc85e18683",
|
||||
};
|
||||
|
||||
// Default site ID for any proxied host not in the map above.
|
||||
// Set to null to skip injection for unmapped hosts.
|
||||
const DEFAULT_SITE_ID = null;
|
||||
|
||||
class HeadInjector {
|
||||
constructor(siteId) {
|
||||
this.siteId = siteId;
|
||||
}
|
||||
|
||||
element(element) {
|
||||
element.prepend(
|
||||
`<script src="https://rybbit.viktorbarzin.me/api/script.js" data-site-id="${this.siteId}" defer></script>`,
|
||||
{ html: true }
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
export default {
|
||||
async fetch(request) {
|
||||
const url = new URL(request.url);
|
||||
const hostname = url.hostname;
|
||||
|
||||
// Look up site ID for this hostname
|
||||
const siteId = SITE_IDS[hostname] || DEFAULT_SITE_ID;
|
||||
|
||||
// Fetch the origin response
|
||||
const response = await fetch(request);
|
||||
|
||||
// Only inject into HTML responses that have a site ID
|
||||
const contentType = response.headers.get("content-type") || "";
|
||||
if (!siteId || !contentType.includes("text/html")) {
|
||||
return response;
|
||||
}
|
||||
|
||||
// Use HTMLRewriter to inject the script before </head>
|
||||
return new HTMLRewriter()
|
||||
.on("head", new HeadInjector(siteId))
|
||||
.transform(response);
|
||||
},
|
||||
};
|
||||
44
stacks/rybbit/worker/wrangler.toml
Normal file
44
stacks/rybbit/worker/wrangler.toml
Normal file
|
|
@ -0,0 +1,44 @@
|
|||
name = "rybbit-analytics"
|
||||
main = "index.js"
|
||||
compatibility_date = "2024-01-01"
|
||||
|
||||
# Explicit per-host routes. Replaces the previous wildcard routes which hit all
|
||||
# ~119 proxied *.viktorbarzin.me hosts and burned the Workers free-tier quota
|
||||
# (100k/day). Only hostnames present in SITE_IDS (index.js) get a route; all
|
||||
# other proxied hosts bypass the Worker entirely.
|
||||
#
|
||||
# rybbit.viktorbarzin.me is INTENTIONALLY EXCLUDED even though it has a site ID
|
||||
# in index.js — it serves the tracker JS (/api/script.js) and event POSTs
|
||||
# (/api/track), which are JSON/JS not HTML, so the content-type guard in
|
||||
# index.js would no-op anyway. Leaving it routed would make every Rybbit event
|
||||
# spawn a Worker invocation for no value (self-amplification).
|
||||
#
|
||||
# When adding a new site to SITE_IDS in index.js, also add its route here.
|
||||
routes = [
|
||||
{ pattern = "viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "www.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "actualbudget.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "crowdsec.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "cyberchef.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "dawarich.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "pma.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "pgadmin.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "audiobookshelf.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "calibre.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "stacks.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "f1.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "frigate.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "highlights-immich.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "immich.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "mail.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "navidrome.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "networking-toolbox.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "nextcloud.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "paperless-ngx.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "privatebin.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "wrongmove.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "send.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "stirling-pdf.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "uptime-kuma.viktorbarzin.me/*", zone_name = "viktorbarzin.me" },
|
||||
{ pattern = "vaultwarden.viktorbarzin.me/*", zone_name = "viktorbarzin.me" }
|
||||
]
|
||||
|
|
@ -611,6 +611,6 @@ PYEOF
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -10,7 +10,6 @@ variable "tls_secret_name" {
|
|||
variable "nfs_server" { type = string }
|
||||
variable "postgresql_host" { type = string }
|
||||
variable "redis_host" { type = string }
|
||||
variable "ollama_host" { type = string }
|
||||
locals {
|
||||
common_env = {
|
||||
TRADING_REDIS_URL = "redis://${var.redis_host}:6379/4"
|
||||
|
|
@ -18,8 +17,6 @@ locals {
|
|||
TRADING_ALPACA_BASE_URL = "https://paper-api.alpaca.markets"
|
||||
TRADING_PAPER_TRADING = "true"
|
||||
TRADING_REDDIT_USER_AGENT = "trading-bot/0.1"
|
||||
TRADING_OLLAMA_HOST = "http://${var.ollama_host}:11434"
|
||||
TRADING_OLLAMA_MODEL = "gemma3"
|
||||
TRADING_WATCHLIST = "[\"AAPL\",\"TSLA\",\"NVDA\",\"MSFT\",\"GOOGL\"]"
|
||||
TRADING_BAR_TIMEFRAME = "5Min"
|
||||
TRADING_POLL_INTERVAL_SECONDS = "60"
|
||||
|
|
@ -317,6 +314,7 @@ resource "kubernetes_deployment" "trading-bot-frontend" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
# DRIFT_WORKAROUND: CI pipeline owns image tags for api + migrations containers. Reviewed 2026-04-18.
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].container[0].image,
|
||||
spec[0].template[0].spec[0].container[1].image,
|
||||
|
|
@ -578,6 +576,7 @@ resource "kubernetes_deployment" "trading-bot-workers" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
# DRIFT_WORKAROUND: CI pipeline owns image tags for all 6 worker containers. Reviewed 2026-04-18.
|
||||
ignore_changes = [
|
||||
spec[0].template[0].spec[0].container[0].image,
|
||||
spec[0].template[0].spec[0].container[1].image,
|
||||
|
|
|
|||
|
|
@ -89,7 +89,7 @@ resource "kubernetes_deployment" "error_pages" {
|
|||
}
|
||||
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -141,12 +141,6 @@ resource "helm_release" "traefik" {
|
|||
protocol = "TCP"
|
||||
expose = { default = true }
|
||||
}
|
||||
ollama-tcp = {
|
||||
port = 11434
|
||||
exposedPort = 11434
|
||||
protocol = "TCP"
|
||||
expose = { default = true }
|
||||
}
|
||||
}
|
||||
|
||||
service = {
|
||||
|
|
|
|||
|
|
@ -101,8 +101,8 @@ resource "kubernetes_deployment" "uptime-kuma" {
|
|||
|
||||
resources {
|
||||
requests = {
|
||||
cpu = "50m"
|
||||
memory = "64Mi"
|
||||
cpu = "100m"
|
||||
memory = "128Mi"
|
||||
}
|
||||
limits = {
|
||||
memory = "512Mi"
|
||||
|
|
@ -339,6 +339,12 @@ FALLBACK_FILE = "/config/targets.json"
|
|||
PREFIX = "[External] "
|
||||
ANNOTATION_ENABLE = "uptime.viktorbarzin.me/external-monitor"
|
||||
ANNOTATION_NAME = "uptime.viktorbarzin.me/external-monitor-name"
|
||||
ANNOTATION_PATH = "uptime.viktorbarzin.me/external-monitor-path"
|
||||
DEFAULT_PATH = "/"
|
||||
# Homepages often serve 200/30x/40x even when backends are degraded.
|
||||
# When an explicit probe path is set we expect a real healthz: tighten codes.
|
||||
STATUSCODES_LENIENT = ["200-299", "300-399", "400-499"]
|
||||
STATUSCODES_STRICT = ["200-299"]
|
||||
SA_DIR = "/var/run/secrets/kubernetes.io/serviceaccount"
|
||||
API_SERVER = f"https://{os.environ.get('KUBERNETES_SERVICE_HOST', 'kubernetes.default.svc.cluster.local')}:{os.environ.get('KUBERNETES_SERVICE_PORT', '443')}"
|
||||
|
||||
|
|
@ -373,11 +379,19 @@ def load_from_api():
|
|||
host = rules[0].get("host")
|
||||
if not host or not host.endswith(".viktorbarzin.me"):
|
||||
continue # skip internal-only or non-public hosts
|
||||
if host in seen:
|
||||
continue
|
||||
seen.add(host)
|
||||
label = anns.get(ANNOTATION_NAME) or host.split(".")[0]
|
||||
targets.append({"name": label, "url": f"https://{host}"})
|
||||
monitor_name = f"{PREFIX}{label}"
|
||||
if monitor_name in seen:
|
||||
continue # dedupe by final monitor name, not hostname (fixes duplicate creation)
|
||||
seen.add(monitor_name)
|
||||
path = anns.get(ANNOTATION_PATH, "").strip()
|
||||
if path and not path.startswith("/"):
|
||||
path = "/" + path
|
||||
# Omit trailing slash when no explicit path — matches pre-existing monitor URLs
|
||||
# and avoids every sync re-updating unchanged monitors.
|
||||
url = f"https://{host}{path}" if path else f"https://{host}"
|
||||
statuscodes = STATUSCODES_STRICT if path else STATUSCODES_LENIENT
|
||||
targets.append({"name": label, "url": url, "statuscodes": statuscodes})
|
||||
return targets
|
||||
|
||||
|
||||
|
|
@ -385,7 +399,7 @@ def load_from_configmap():
|
|||
"""Legacy fallback: read the ConfigMap list."""
|
||||
with open(FALLBACK_FILE) as f:
|
||||
raw = json.load(f)
|
||||
return [{"name": t["name"], "url": t["url"]} for t in raw]
|
||||
return [{"name": t["name"], "url": t["url"], "statuscodes": STATUSCODES_LENIENT} for t in raw]
|
||||
|
||||
|
||||
try:
|
||||
|
|
@ -412,10 +426,12 @@ for m in monitors:
|
|||
existing_external[m["name"]] = m
|
||||
|
||||
target_names = set()
|
||||
targets_by_name = {}
|
||||
created = 0
|
||||
for t in targets:
|
||||
monitor_name = f"{PREFIX}{t['name']}"
|
||||
target_names.add(monitor_name)
|
||||
targets_by_name[monitor_name] = t
|
||||
if monitor_name not in existing_external:
|
||||
print(f"Creating monitor: {monitor_name} -> {t['url']}")
|
||||
api.add_monitor(
|
||||
|
|
@ -424,11 +440,31 @@ for t in targets:
|
|||
url=t["url"],
|
||||
interval=300,
|
||||
maxretries=3,
|
||||
accepted_statuscodes=["200-299", "300-399", "400-499"],
|
||||
accepted_statuscodes=t["statuscodes"],
|
||||
)
|
||||
created += 1
|
||||
time.sleep(0.3)
|
||||
|
||||
# Update monitors whose target URL or accepted status codes drifted
|
||||
# (e.g., new probe-path annotation added on an existing ingress).
|
||||
updated = 0
|
||||
for monitor_name, t in targets_by_name.items():
|
||||
existing = existing_external.get(monitor_name)
|
||||
if not existing:
|
||||
continue
|
||||
current_url = existing.get("url")
|
||||
current_codes = existing.get("accepted_statuscodes") or []
|
||||
if current_url == t["url"] and current_codes == t["statuscodes"]:
|
||||
continue
|
||||
print(f"Updating monitor {monitor_name}: {current_url} -> {t['url']} (codes {current_codes} -> {t['statuscodes']})")
|
||||
api.edit_monitor(
|
||||
existing["id"],
|
||||
url=t["url"],
|
||||
accepted_statuscodes=t["statuscodes"],
|
||||
)
|
||||
updated += 1
|
||||
time.sleep(0.3)
|
||||
|
||||
# Remove monitors for services no longer in the list
|
||||
deleted = 0
|
||||
for name, m in existing_external.items():
|
||||
|
|
@ -439,7 +475,8 @@ for name, m in existing_external.items():
|
|||
time.sleep(0.3)
|
||||
|
||||
api.disconnect()
|
||||
print(f"Sync complete: {created} created, {deleted} deleted, {len(target_names) - created} unchanged")
|
||||
unchanged = len(target_names) - created - updated
|
||||
print(f"Sync complete: {created} created, {updated} updated, {deleted} deleted, {unchanged} unchanged")
|
||||
PYEOF
|
||||
EOT
|
||||
]
|
||||
|
|
@ -480,6 +517,196 @@ PYEOF
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
# =============================================================================
|
||||
# Internal Monitor Sync
|
||||
# Declaratively manages monitors for internal services (databases, non-HTTP
|
||||
# endpoints) that can't be discovered from ingress annotations. Idempotent:
|
||||
# looks up monitors by name, creates if missing, patches if drifted.
|
||||
#
|
||||
# Why a CronJob and not a one-shot Job:
|
||||
# - louislam/uptime-kuma has no Terraform provider (only a CLI tool).
|
||||
# - UK v2 stores monitors in MariaDB (`uptimekuma` on mysql.dbaas); if the DB
|
||||
# is wiped/restored we must re-create them.
|
||||
# - CronJob self-heals drift (manual UI edits, UK restarts, DB restores).
|
||||
#
|
||||
# Managed monitors (name -> desired spec) are defined in local.internal_monitors
|
||||
# below. Add new internal-service monitors there.
|
||||
# =============================================================================
|
||||
|
||||
locals {
|
||||
internal_monitors = [
|
||||
{
|
||||
name = "MySQL Standalone (dbaas)"
|
||||
type = "mysql"
|
||||
database_connection_string = "mysql://uptimekuma@mysql.dbaas.svc.cluster.local:3306"
|
||||
database_password_vault_key = "uptimekuma_db_password"
|
||||
interval = 60
|
||||
retry_interval = 60
|
||||
max_retries = 2
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
resource "kubernetes_secret" "internal_monitor_sync" {
|
||||
metadata {
|
||||
name = "internal-monitor-sync"
|
||||
namespace = kubernetes_namespace.uptime-kuma.metadata[0].name
|
||||
}
|
||||
data = merge(
|
||||
{ UPTIME_KUMA_PASSWORD = data.vault_kv_secret_v2.viktor.data["uptime_kuma_admin_password"] },
|
||||
{
|
||||
for m in local.internal_monitors :
|
||||
"DB_PASSWORD_${upper(replace(m.name, "/[^A-Za-z0-9]/", "_"))}" =>
|
||||
data.vault_kv_secret_v2.viktor.data[m.database_password_vault_key]
|
||||
},
|
||||
)
|
||||
}
|
||||
|
||||
resource "kubernetes_config_map_v1" "internal_monitor_targets" {
|
||||
metadata {
|
||||
name = "internal-monitor-targets"
|
||||
namespace = kubernetes_namespace.uptime-kuma.metadata[0].name
|
||||
}
|
||||
data = {
|
||||
"targets.json" = jsonencode([
|
||||
for m in local.internal_monitors : {
|
||||
name = m.name
|
||||
type = m.type
|
||||
database_connection_string = m.database_connection_string
|
||||
password_env = "DB_PASSWORD_${upper(replace(m.name, "/[^A-Za-z0-9]/", "_"))}"
|
||||
interval = m.interval
|
||||
retry_interval = m.retry_interval
|
||||
max_retries = m.max_retries
|
||||
}
|
||||
])
|
||||
}
|
||||
}
|
||||
|
||||
resource "kubernetes_cron_job_v1" "internal_monitor_sync" {
|
||||
metadata {
|
||||
name = "internal-monitor-sync"
|
||||
namespace = kubernetes_namespace.uptime-kuma.metadata[0].name
|
||||
}
|
||||
spec {
|
||||
concurrency_policy = "Forbid"
|
||||
failed_jobs_history_limit = 3
|
||||
successful_jobs_history_limit = 3
|
||||
schedule = "*/10 * * * *"
|
||||
job_template {
|
||||
metadata {}
|
||||
spec {
|
||||
backoff_limit = 1
|
||||
ttl_seconds_after_finished = 300
|
||||
template {
|
||||
metadata {}
|
||||
spec {
|
||||
container {
|
||||
name = "sync"
|
||||
image = "docker.io/library/python:3.12-alpine"
|
||||
command = ["/bin/sh", "-c", <<-EOT
|
||||
pip install --quiet --disable-pip-version-check uptime-kuma-api
|
||||
python3 << 'PYEOF'
|
||||
import json, os, time
|
||||
from uptime_kuma_api import UptimeKumaApi, MonitorType
|
||||
|
||||
UPTIME_KUMA_URL = "http://uptime-kuma.uptime-kuma.svc.cluster.local"
|
||||
UPTIME_KUMA_PASS = os.environ["UPTIME_KUMA_PASSWORD"]
|
||||
|
||||
with open("/config/targets.json") as f:
|
||||
targets = json.load(f)
|
||||
|
||||
api = UptimeKumaApi(UPTIME_KUMA_URL, timeout=120, wait_events=0.2)
|
||||
api.login("admin", UPTIME_KUMA_PASS)
|
||||
|
||||
existing = {m["name"]: m for m in api.get_monitors()}
|
||||
|
||||
for t in targets:
|
||||
name = t["name"]
|
||||
password = os.environ[t["password_env"]]
|
||||
# MYSQL monitors use `databaseConnectionString` + `radiusPassword`
|
||||
# (UK v2 re-uses the radiusPassword field for mysql auth — backwards compat).
|
||||
desired = {
|
||||
"type": MonitorType(t["type"]),
|
||||
"name": name,
|
||||
"databaseConnectionString": t["database_connection_string"],
|
||||
"radiusPassword": password,
|
||||
"interval": t["interval"],
|
||||
"retryInterval": t["retry_interval"],
|
||||
"maxretries": t["max_retries"],
|
||||
}
|
||||
if name not in existing:
|
||||
print(f"Creating monitor: {name}")
|
||||
api.add_monitor(**desired)
|
||||
continue
|
||||
m = existing[name]
|
||||
drifted = (
|
||||
m.get("databaseConnectionString") != desired["databaseConnectionString"]
|
||||
or m.get("radiusPassword") != desired["radiusPassword"]
|
||||
or m.get("interval") != desired["interval"]
|
||||
or m.get("retryInterval") != desired["retryInterval"]
|
||||
or m.get("maxretries") != desired["maxretries"]
|
||||
)
|
||||
if drifted:
|
||||
print(f"Updating monitor {name} (id={m['id']})")
|
||||
api.edit_monitor(
|
||||
m["id"],
|
||||
databaseConnectionString=desired["databaseConnectionString"],
|
||||
radiusPassword=desired["radiusPassword"],
|
||||
interval=desired["interval"],
|
||||
retryInterval=desired["retryInterval"],
|
||||
maxretries=desired["maxretries"],
|
||||
)
|
||||
else:
|
||||
print(f"Monitor {name} (id={m['id']}) already in desired state")
|
||||
time.sleep(0.3)
|
||||
|
||||
api.disconnect()
|
||||
print("Internal monitor sync complete")
|
||||
PYEOF
|
||||
EOT
|
||||
]
|
||||
env_from {
|
||||
secret_ref {
|
||||
name = kubernetes_secret.internal_monitor_sync.metadata[0].name
|
||||
}
|
||||
}
|
||||
volume_mount {
|
||||
name = "config"
|
||||
mount_path = "/config"
|
||||
read_only = true
|
||||
}
|
||||
resources {
|
||||
requests = {
|
||||
memory = "128Mi"
|
||||
cpu = "10m"
|
||||
}
|
||||
limits = {
|
||||
memory = "256Mi"
|
||||
}
|
||||
}
|
||||
}
|
||||
volume {
|
||||
name = "config"
|
||||
config_map {
|
||||
name = kubernetes_config_map_v1.internal_monitor_targets.metadata[0].name
|
||||
}
|
||||
}
|
||||
dns_config {
|
||||
option {
|
||||
name = "ndots"
|
||||
value = "2"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].job_template[0].spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -523,7 +523,7 @@ resource "vault_database_secret_backend_connection" "postgresql" {
|
|||
# "pg-trading", # Commented out 2026-04-06 - trading-bot disabled
|
||||
"pg-health", "pg-linkwarden",
|
||||
"pg-affine", "pg-woodpecker", "pg-claude-memory",
|
||||
"pg-terraform-state"
|
||||
"pg-terraform-state", "pg-payslip-ingest"
|
||||
]
|
||||
|
||||
postgresql {
|
||||
|
|
@ -661,6 +661,14 @@ resource "vault_database_secret_backend_static_role" "pg_terraform_state" {
|
|||
rotation_period = 604800
|
||||
}
|
||||
|
||||
resource "vault_database_secret_backend_static_role" "pg_payslip_ingest" {
|
||||
backend = vault_mount.database.path
|
||||
db_name = vault_database_secret_backend_connection.postgresql.name
|
||||
name = "pg-payslip-ingest"
|
||||
username = "payslip_ingest"
|
||||
rotation_period = 604800
|
||||
}
|
||||
|
||||
# =============================================================================
|
||||
# Kubernetes Secrets Engine — Dynamic K8s Credentials
|
||||
# =============================================================================
|
||||
|
|
|
|||
|
|
@ -76,7 +76,7 @@ resource "kubernetes_persistent_volume_claim" "data_proxmox" {
|
|||
|
||||
resource "kubernetes_deployment" "wealthfolio" {
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
metadata {
|
||||
name = "wealthfolio"
|
||||
|
|
|
|||
|
|
@ -230,7 +230,7 @@ resource "kubernetes_deployment" "webhook_handler" {
|
|||
}
|
||||
}
|
||||
lifecycle {
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config]
|
||||
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -277,25 +277,3 @@ resource "kubernetes_manifest" "piper_tcp_ingressroute" {
|
|||
}
|
||||
}
|
||||
|
||||
# TCP passthrough from Traefik to ollama service (for HA voice pipeline)
|
||||
resource "kubernetes_manifest" "ollama_tcp_ingressroute" {
|
||||
manifest = {
|
||||
apiVersion = "traefik.io/v1alpha1"
|
||||
kind = "IngressRouteTCP"
|
||||
metadata = {
|
||||
name = "ollama-tcp"
|
||||
namespace = "traefik"
|
||||
}
|
||||
spec = {
|
||||
entryPoints = ["ollama-tcp"]
|
||||
routes = [{
|
||||
match = "HostSNI(`*`)"
|
||||
services = [{
|
||||
name = "ollama"
|
||||
namespace = "ollama"
|
||||
port = 11434
|
||||
}]
|
||||
}]
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -33,7 +33,6 @@ resource "kubernetes_manifest" "external_secret" {
|
|||
}
|
||||
|
||||
variable "redis_host" { type = string }
|
||||
variable "ollama_host" { type = string }
|
||||
|
||||
|
||||
resource "kubernetes_namespace" "ytdlp" {
|
||||
|
|
@ -285,15 +284,6 @@ resource "kubernetes_deployment" "yt_highlights" {
|
|||
name = "TORCH_HOME"
|
||||
value = "/data/cache/torch"
|
||||
}
|
||||
# Ollama fallback for when OpenRouter models fail
|
||||
env {
|
||||
name = "OLLAMA_URL"
|
||||
value = "http://${var.ollama_host}:11434"
|
||||
}
|
||||
env {
|
||||
name = "OLLAMA_MODEL"
|
||||
value = "qwen2.5:14b"
|
||||
}
|
||||
volume_mount {
|
||||
name = "data"
|
||||
mount_path = "/data"
|
||||
|
|
|
|||
File diff suppressed because one or more lines are too long
Loading…
Add table
Add a link
Reference in a new issue