Files
Vetting/docs/configuration.md
T
josh 8367ec2a9f
CI / Lint + build + test (push) Successful in 1m36s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 49s
docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API
reference with full request/response schemas, database schema), expand
the README with a feature list and how-it-works walkthrough, fix
missing Firmware and Burn stages in architecture.md and test-suite.md,
add threshold engine and host-mode agent sections, and add godoc
comments to 11 packages and 6 model types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-23 18:37:26 -04:00

354 lines
13 KiB
Markdown

# Configuration reference
The orchestrator reads a single YAML file at startup. Production
installs use `/etc/vetting/vetting.yaml`; the dev default is
`deploy/vetting.example.yaml`. Pass the path with `--config`:
```
vetting --config /etc/vetting/vetting.yaml
```
Every key has a compile-time default (see `internal/config/config.go`),
so an empty file produces a working orchestrator bound to
`127.0.0.1:8080` with PXE disabled.
---
## `server`
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `bind` | string | `127.0.0.1:8080` | Address and port the HTTP server listens on. |
| `public_url` | string | *(empty)* | External URL the orchestrator is reachable at from a browser. Used in notification click-throughs (e.g. `https://vetting.lan:8443`). |
| `tls.enabled` | bool | `false` | Terminate TLS at the orchestrator. |
| `tls.cert_file` | string | *(empty)* | Path to the PEM-encoded certificate. |
| `tls.key_file` | string | *(empty)* | Path to the PEM-encoded private key. |
## `database`
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `path` | string | `./var/vetting.db` | SQLite database file. Created on first run. |
## `artifacts`
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `dir` | string | `./var/artifacts` | Directory for per-run files (reports, fio logs, iperf logs, hold keys). |
| `retention_days` | int | `30` | Days to keep artifact files before the janitor prunes them. `0` = keep forever. DB rows are never pruned. |
## `logs`
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `dir` | string | `./var/logs` | Directory for per-run append-only log files. |
| `retention_days` | int | `30` | Days to keep log files. `0` = keep forever. |
## `janitor`
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `interval_minutes` | int | `60` | Minutes between cleanup sweeps. `0` defaults to `60`. |
## `dispatcher`
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `max_concurrent_runs` | int | `3` | Semaphore limiting how many vetting runs execute in parallel. |
## `network`
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `iperf_port` | int | `5201` | Port the orchestrator-supervised `iperf3 -s` binds to. The agent connects here during the Network stage. |
## `pxe`
PXE is disabled by default. Enable it after running
[`vetting-pxe-setup`](operations.md#pxe-enablement).
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `enabled` | bool | `false` | Enable dnsmasq + iPXE serving. |
| `interface` | string | *(empty)* | LAN NIC the dnsmasq proxy-DHCP binds to (e.g. `eth0`). |
| `subnet` | string | *(empty)* | LAN CIDR (e.g. `192.168.1.0/24`). Scopes the proxy-DHCP responses. |
| `orchestrator_url` | string | *(empty)* | URL the live-image agent uses to reach the orchestrator (e.g. `http://192.168.1.135:8080`). Baked into the iPXE kernel cmdline. |
| `tftp_root` | string | *(empty)* | Directory containing `ipxe.efi` + `undionly.kpxe`. |
| `live_dir` | string | *(empty)* | Directory containing `vmlinuz` + `initrd.img`. Served at `/live/*`. |
dnsmasq runs in **proxy-DHCP mode**: it coexists with your existing
router's DHCP server and only supplements PXE options. See
[operations.md](operations.md#pxe-enablement) for the full setup
walkthrough.
## `agent`
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `asset_dir` | string | `<database.dir>/../assets` | Directory containing `vetting-agent-linux-amd64`. Served at `/assets/*` so the quick-register one-liner can download the agent binary. Empty string disables the route. |
## `notifiers`
An array of notification targets. Each entry declares a named notifier
with a type-specific set of fields. Delivery is fire-and-forget (one
attempt per event, 10 s timeout, failures logged).
### ntfy
```yaml
notifiers:
- name: ops-ntfy
type: ntfy
server: https://ntfy.sh
topic: vetting-YOUR-TOPIC
```
| Field | Type | Description |
|-------|------|-------------|
| `name` | string | Identifier referenced by `routes[].notifier`. |
| `type` | string | `ntfy` |
| `server` | string | ntfy server URL. |
| `topic` | string | Topic to publish to. |
### Discord
```yaml
notifiers:
- name: ops-discord
type: discord
webhook_url: https://discord.com/api/webhooks/XXX/YYY
```
| Field | Type | Description |
|-------|------|-------------|
| `name` | string | Identifier referenced by `routes[].notifier`. |
| `type` | string | `discord` |
| `webhook_url` | string | Discord webhook URL. |
### SMTP
```yaml
notifiers:
- name: ops-email
type: smtp
smtp:
host: mail.lan
port: 25
from: vetting@lan.local
to: [ops@lan.local]
```
| Field | Type | Description |
|-------|------|-------------|
| `name` | string | Identifier referenced by `routes[].notifier`. |
| `type` | string | `smtp` |
| `smtp.host` | string | SMTP server hostname. |
| `smtp.port` | int | SMTP server port. |
| `smtp.from` | string | Sender address. |
| `smtp.to` | string[] | Recipient addresses. |
## `routes`
Routes map notification events to notifiers by kind and severity.
Each route is evaluated independently; an event can match multiple
routes and fire on multiple notifiers.
```yaml
routes:
- match_severity: [critical]
notifier: ops-ntfy
- match_severity: [critical]
notifier: ops-discord
- match_kind: [RunCompleted]
notifier: ops-ntfy
```
| Field | Type | Description |
|-------|------|-------------|
| `match_kind` | string[] | Event kinds to match: `StageFailed`, `SpecMismatch`, `HoldingOpened`, `RunCompleted`. Omit to match all kinds. |
| `match_severity` | string[] | Severities to match: `critical`, `warning`, `info`. Omit to match all severities. |
| `notifier` | string | Name of a declared notifier to deliver to. |
## `vetting`
Shared pipeline defaults that apply to all profiles.
### `vetting.stages`
Ordered list of stage names the pipeline walks. Default:
```yaml
vetting:
stages:
- Inventory
- Firmware
- SpecValidate
- SMART
- CPUStress
- Storage
- Network
- Burn
- GPU
- PSU
- Reporting
```
### `vetting.thresholds`
Array of threshold rules evaluated against every `/sensor` batch.
Rules apply across all profiles — a 92 C CPU limit fails both a
2-minute quick run and a 12-hour soak.
| Field | Type | Description |
|-------|------|-------------|
| `stage` | string | Stage selector. `*` matches any stage; exact name (e.g. `PSU`) limits to that stage. |
| `kind` | string | Measurement kind to match: `temp`, `psu_volt`, `iperf`, `fio_p99_us`, `nic_retrans`, `edac_ue`, `edac_ce`, `mce`, `smart_attr`, `fan`. |
| `key` | string | Key selector. Glob-ish matching: `*` matches all, `cpu/*` matches keys starting with `cpu/`, exact string for specific keys. |
| `op` | string | Comparison operator (see table below). |
| `value` | float | Threshold limit. |
| `nominal` | float | Reference value, only used by `within_pct` (e.g. `12.0` for a +12 V rail). |
| `unit` | string | Display unit (e.g. `C`, `V`, `Mbps`). Informational only. |
| `severity` | string | `critical` = fail the run immediately. `warning` = record for the report only. |
**Threshold operators:**
| Operator | Pass condition | Typical use |
|----------|---------------|-------------|
| `lt` | `observed < value` | CPU temp < 92 C |
| `lte` | `observed <= value` | EDAC UE count <= 0 |
| `gt` | `observed > value` | — |
| `gte` | `observed >= value` | iperf throughput >= 900 Mbps |
| `within_pct` | `abs(observed - nominal) / nominal * 100 <= value` | +12 V rail within 5 % of 12.0 V |
**Default thresholds** (from `deploy/vetting.example.yaml`):
```yaml
thresholds:
- { stage: "*", kind: temp, key: "cpu/*", op: lt, value: 92, unit: C, severity: critical }
- { stage: PSU, kind: psu_volt, key: "+12V", op: within_pct, value: 5, nominal: 12.0, severity: critical }
- { stage: PSU, kind: psu_volt, key: "+5V", op: within_pct, value: 5, nominal: 5.0, severity: critical }
- { stage: PSU, kind: psu_volt, key: "+3.3V", op: within_pct, value: 5, nominal: 3.3, severity: critical }
- { stage: Storage, kind: fio_p99_us, key: "*", op: lt, value: 50000, severity: warning }
- { stage: Network, kind: iperf, key: throughput_mbps, op: gte, value: 900, severity: critical }
- { stage: Network, kind: nic_retrans, key: "*/rate", op: lt, value: 0.001, severity: warning }
- { stage: CPUStress, kind: edac_ue, key: "*", op: lte, value: 0, severity: critical }
- { stage: CPUStress, kind: mce, key: "*", op: lte, value: 0, severity: critical }
```
## `profiles`
Three built-in profiles control per-stage durations and probe knobs.
Every profile exercises every probe and gate — only the durations
scale. Quick is a ~10-minute same-day sanity check; deep is the
8-12 hour overnight soak; soak is the opt-in 36-40 hour extreme run.
### Profile inheritance
A profile can declare `inherit: <parent>` to merge the parent's
timeouts and defaults before applying its own overrides. Child keys
win. The default `soak` profile inherits from `deep`.
### `stage_timeouts`
Per-stage time limits. The orchestrator kills the agent's stage
subprocess when a timeout fires.
| Stage | quick | deep | soak |
|-------|-------|------|------|
| CPUStress | 5 m | 2 h | 14 h |
| Storage | 5 m | 4 h | 8 h |
| Network | 2 m | 35 m | 2 h 30 m |
| Burn | 3 m | 3 h | 20 h |
| PSU | 1 m | 10 m | 15 m |
### `defaults`
Per-stage probe knobs shipped to the agent on `/claim`. Empty values
mean "fall back to the agent's compile-time default".
#### `cpustress`
| Knob | Type | Description | quick | deep | soak |
|------|------|-------------|-------|------|------|
| `cpu_pass` | duration | `stress-ng --cpu` duration | 2 m | 60 m | 12 h |
| `mem_pass` | duration | `stress-ng --vm` duration | 2 m | 60 m | *(inherit)* |
| `edac_poll` | duration | EDAC error counter polling interval | 10 s | 10 s | *(inherit)* |
#### `storage`
| Knob | Type | Description | quick | deep | soak |
|------|------|-------------|-------|------|------|
| `mode` | string | `fio_sample` (skip badblocks) or `full_disk` (badblocks + fio) | fio_sample | full_disk | full_disk |
| `fio_size` | string | fio test file size (only in `fio_sample` mode) | 1 GiB | *(inherit)* | *(inherit)* |
| `fio_time` | duration | fio runtime | 3 m | 2 h | 6 h |
| `fio_bs` | string | fio block size | 4 k | 4 k | *(inherit)* |
| `fio_rw` | string | fio I/O pattern | randrw | randrw | *(inherit)* |
| `verify` | string | fio integrity mode (`md5` or empty) | md5 | md5 | *(inherit)* |
#### `network`
| Knob | Type | Description | quick | deep | soak |
|------|------|-------------|-------|------|------|
| `duration` | duration | `iperf3` test duration | 60 s | 30 m | 2 h |
#### `burn`
| Knob | Type | Description | quick | deep | soak |
|------|------|-------------|-------|------|------|
| `duration` | duration | Total burn-in window (CPU + mem + disk + net simultaneously) | 2 m | 2 h | 18 h |
| `cpu_workers` | string | `all` (= `runtime.NumCPU()`) or a numeric string | all | all | *(inherit)* |
| `mem_pct` | int | Percentage of MemAvailable to stress | 50 | 70 | *(inherit)* |
| `fio_on_spare` | bool | Run fio inside Burn (requires a spare partition) | true | true | *(inherit)* |
| `iperf_parallel` | int | Parallel stream count fed to `iperf3 -P` | 2 | 4 | 8 |
### Example profile block
```yaml
profiles:
quick:
stage_timeouts:
CPUStress: 5m
Storage: 5m
Network: 2m
defaults:
cpustress: { cpu_pass: 2m, mem_pass: 2m, edac_poll: 10s }
storage: { mode: fio_sample, fio_size: 1GiB, fio_time: 3m, fio_bs: 4k, fio_rw: randrw, verify: md5 }
network: { duration: 60s }
burn: { duration: 2m, cpu_workers: all, mem_pct: 50, fio_on_spare: true, iperf_parallel: 2 }
deep:
stage_timeouts:
CPUStress: 2h
Storage: 4h
Network: 35m
defaults:
cpustress: { cpu_pass: 60m, mem_pass: 60m, edac_poll: 10s }
storage: { mode: full_disk, fio_time: 2h, fio_bs: 4k, fio_rw: randrw, verify: md5 }
network: { duration: 30m }
burn: { duration: 2h, cpu_workers: all, mem_pct: 70, fio_on_spare: true, iperf_parallel: 4 }
soak:
inherit: deep
stage_timeouts:
CPUStress: 14h
Storage: 8h
Network: 2h30m
defaults:
cpustress: { cpu_pass: 12h }
storage: { mode: full_disk, fio_time: 6h }
network: { duration: 2h }
burn: { duration: 18h, iperf_parallel: 8 }
```
---
## Host-mode agent config
The persistent host-mode agent reads a separate file at
`/etc/vetting/host-agent.yaml`. This is installed by the
quick-register one-liner and is distinct from the orchestrator config.
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `orchestrator_url` | string | *(required)* | URL of the orchestrator (e.g. `http://192.168.1.135:8080`). |
| `mac` | string | *(auto-detected)* | MAC address to heartbeat as. Auto-detected from the default route NIC if omitted. |
| `interval` | duration | `30s` | Heartbeat interval. |