# Configuration reference The orchestrator reads a single YAML file at startup. Production installs use `/etc/vetting/vetting.yaml`; the dev default is `deploy/vetting.example.yaml`. Pass the path with `--config`: ``` vetting --config /etc/vetting/vetting.yaml ``` Every key has a compile-time default (see `internal/config/config.go`), so an empty file produces a working orchestrator bound to `127.0.0.1:8080` with PXE disabled. --- ## `server` | Key | Type | Default | Description | |-----|------|---------|-------------| | `bind` | string | `127.0.0.1:8080` | Address and port the HTTP server listens on. | | `public_url` | string | *(empty)* | External URL the orchestrator is reachable at from a browser. Used in notification click-throughs (e.g. `https://vetting.lan:8443`). | | `tls.enabled` | bool | `false` | Terminate TLS at the orchestrator. | | `tls.cert_file` | string | *(empty)* | Path to the PEM-encoded certificate. | | `tls.key_file` | string | *(empty)* | Path to the PEM-encoded private key. | ## `database` | Key | Type | Default | Description | |-----|------|---------|-------------| | `path` | string | `./var/vetting.db` | SQLite database file. Created on first run. | ## `artifacts` | Key | Type | Default | Description | |-----|------|---------|-------------| | `dir` | string | `./var/artifacts` | Directory for per-run files (reports, fio logs, iperf logs, hold keys). | | `retention_days` | int | `30` | Days to keep artifact files before the janitor prunes them. `0` = keep forever. DB rows are never pruned. | ## `logs` | Key | Type | Default | Description | |-----|------|---------|-------------| | `dir` | string | `./var/logs` | Directory for per-run append-only log files. | | `retention_days` | int | `30` | Days to keep log files. `0` = keep forever. | ## `janitor` | Key | Type | Default | Description | |-----|------|---------|-------------| | `interval_minutes` | int | `60` | Minutes between cleanup sweeps. `0` defaults to `60`. | ## `dispatcher` | Key | Type | Default | Description | |-----|------|---------|-------------| | `max_concurrent_runs` | int | `3` | Semaphore limiting how many vetting runs execute in parallel. | ## `network` | Key | Type | Default | Description | |-----|------|---------|-------------| | `iperf_port` | int | `5201` | Port the orchestrator-supervised `iperf3 -s` binds to. The agent connects here during the Network stage. | ## `pxe` PXE is disabled by default. Enable it after running [`vetting-pxe-setup`](operations.md#pxe-enablement). | Key | Type | Default | Description | |-----|------|---------|-------------| | `enabled` | bool | `false` | Enable dnsmasq + iPXE serving. | | `interface` | string | *(empty)* | LAN NIC the dnsmasq proxy-DHCP binds to (e.g. `eth0`). | | `subnet` | string | *(empty)* | LAN CIDR (e.g. `192.168.1.0/24`). Scopes the proxy-DHCP responses. | | `orchestrator_url` | string | *(empty)* | URL the live-image agent uses to reach the orchestrator (e.g. `http://192.168.1.135:8080`). Baked into the iPXE kernel cmdline. | | `tftp_root` | string | *(empty)* | Directory containing `ipxe.efi` + `undionly.kpxe`. | | `live_dir` | string | *(empty)* | Directory containing `vmlinuz` + `initrd.img`. Served at `/live/*`. | dnsmasq runs in **proxy-DHCP mode**: it coexists with your existing router's DHCP server and only supplements PXE options. See [operations.md](operations.md#pxe-enablement) for the full setup walkthrough. ## `agent` | Key | Type | Default | Description | |-----|------|---------|-------------| | `asset_dir` | string | `/../assets` | Directory containing `vetting-agent-linux-amd64`. Served at `/assets/*` so the quick-register one-liner can download the agent binary. Empty string disables the route. | ## `notifiers` An array of notification targets. Each entry declares a named notifier with a type-specific set of fields. Delivery is fire-and-forget (one attempt per event, 10 s timeout, failures logged). ### ntfy ```yaml notifiers: - name: ops-ntfy type: ntfy server: https://ntfy.sh topic: vetting-YOUR-TOPIC ``` | Field | Type | Description | |-------|------|-------------| | `name` | string | Identifier referenced by `routes[].notifier`. | | `type` | string | `ntfy` | | `server` | string | ntfy server URL. | | `topic` | string | Topic to publish to. | ### Discord ```yaml notifiers: - name: ops-discord type: discord webhook_url: https://discord.com/api/webhooks/XXX/YYY ``` | Field | Type | Description | |-------|------|-------------| | `name` | string | Identifier referenced by `routes[].notifier`. | | `type` | string | `discord` | | `webhook_url` | string | Discord webhook URL. | ### SMTP ```yaml notifiers: - name: ops-email type: smtp smtp: host: mail.lan port: 25 from: vetting@lan.local to: [ops@lan.local] ``` | Field | Type | Description | |-------|------|-------------| | `name` | string | Identifier referenced by `routes[].notifier`. | | `type` | string | `smtp` | | `smtp.host` | string | SMTP server hostname. | | `smtp.port` | int | SMTP server port. | | `smtp.from` | string | Sender address. | | `smtp.to` | string[] | Recipient addresses. | ## `routes` Routes map notification events to notifiers by kind and severity. Each route is evaluated independently; an event can match multiple routes and fire on multiple notifiers. ```yaml routes: - match_severity: [critical] notifier: ops-ntfy - match_severity: [critical] notifier: ops-discord - match_kind: [RunCompleted] notifier: ops-ntfy ``` | Field | Type | Description | |-------|------|-------------| | `match_kind` | string[] | Event kinds to match: `StageFailed`, `SpecMismatch`, `HoldingOpened`, `RunCompleted`. Omit to match all kinds. | | `match_severity` | string[] | Severities to match: `critical`, `warning`, `info`. Omit to match all severities. | | `notifier` | string | Name of a declared notifier to deliver to. | ## `vetting` Shared pipeline defaults that apply to all profiles. ### `vetting.stages` Ordered list of stage names the pipeline walks. Default: ```yaml vetting: stages: - Inventory - Firmware - SpecValidate - SMART - CPUStress - Storage - Network - Burn - GPU - PSU - Reporting ``` ### `vetting.thresholds` Array of threshold rules evaluated against every `/sensor` batch. Rules apply across all profiles — a 92 C CPU limit fails both a 2-minute quick run and a 12-hour soak. | Field | Type | Description | |-------|------|-------------| | `stage` | string | Stage selector. `*` matches any stage; exact name (e.g. `PSU`) limits to that stage. | | `kind` | string | Measurement kind to match: `temp`, `psu_volt`, `iperf`, `fio_p99_us`, `nic_retrans`, `edac_ue`, `edac_ce`, `mce`, `smart_attr`, `fan`. | | `key` | string | Key selector. Glob-ish matching: `*` matches all, `cpu/*` matches keys starting with `cpu/`, exact string for specific keys. | | `op` | string | Comparison operator (see table below). | | `value` | float | Threshold limit. | | `nominal` | float | Reference value, only used by `within_pct` (e.g. `12.0` for a +12 V rail). | | `unit` | string | Display unit (e.g. `C`, `V`, `Mbps`). Informational only. | | `severity` | string | `critical` = fail the run immediately. `warning` = record for the report only. | **Threshold operators:** | Operator | Pass condition | Typical use | |----------|---------------|-------------| | `lt` | `observed < value` | CPU temp < 92 C | | `lte` | `observed <= value` | EDAC UE count <= 0 | | `gt` | `observed > value` | — | | `gte` | `observed >= value` | iperf throughput >= 900 Mbps | | `within_pct` | `abs(observed - nominal) / nominal * 100 <= value` | +12 V rail within 5 % of 12.0 V | **Default thresholds** (from `deploy/vetting.example.yaml`): ```yaml thresholds: - { stage: "*", kind: temp, key: "cpu/*", op: lt, value: 92, unit: C, severity: critical } - { stage: PSU, kind: psu_volt, key: "+12V", op: within_pct, value: 5, nominal: 12.0, severity: critical } - { stage: PSU, kind: psu_volt, key: "+5V", op: within_pct, value: 5, nominal: 5.0, severity: critical } - { stage: PSU, kind: psu_volt, key: "+3.3V", op: within_pct, value: 5, nominal: 3.3, severity: critical } - { stage: Storage, kind: fio_p99_us, key: "*", op: lt, value: 50000, severity: warning } - { stage: Network, kind: iperf, key: throughput_mbps, op: gte, value: 900, severity: critical } - { stage: Network, kind: nic_retrans, key: "*/rate", op: lt, value: 0.001, severity: warning } - { stage: CPUStress, kind: edac_ue, key: "*", op: lte, value: 0, severity: critical } - { stage: CPUStress, kind: mce, key: "*", op: lte, value: 0, severity: critical } ``` ## `profiles` Three built-in profiles control per-stage durations and probe knobs. Every profile exercises every probe and gate — only the durations scale. Quick is a ~10-minute same-day sanity check; deep is the 8-12 hour overnight soak; soak is the opt-in 36-40 hour extreme run. ### Profile inheritance A profile can declare `inherit: ` to merge the parent's timeouts and defaults before applying its own overrides. Child keys win. The default `soak` profile inherits from `deep`. ### `stage_timeouts` Per-stage time limits. The orchestrator kills the agent's stage subprocess when a timeout fires. | Stage | quick | deep | soak | |-------|-------|------|------| | CPUStress | 5 m | 2 h | 14 h | | Storage | 5 m | 4 h | 8 h | | Network | 2 m | 35 m | 2 h 30 m | | Burn | 3 m | 3 h | 20 h | | PSU | 1 m | 10 m | 15 m | ### `defaults` Per-stage probe knobs shipped to the agent on `/claim`. Empty values mean "fall back to the agent's compile-time default". #### `cpustress` | Knob | Type | Description | quick | deep | soak | |------|------|-------------|-------|------|------| | `cpu_pass` | duration | `stress-ng --cpu` duration | 2 m | 60 m | 12 h | | `mem_pass` | duration | `stress-ng --vm` duration | 2 m | 60 m | *(inherit)* | | `edac_poll` | duration | EDAC error counter polling interval | 10 s | 10 s | *(inherit)* | #### `storage` | Knob | Type | Description | quick | deep | soak | |------|------|-------------|-------|------|------| | `mode` | string | `fio_sample` (skip badblocks) or `full_disk` (badblocks + fio) | fio_sample | full_disk | full_disk | | `fio_size` | string | fio test file size (only in `fio_sample` mode) | 1 GiB | *(inherit)* | *(inherit)* | | `fio_time` | duration | fio runtime | 3 m | 2 h | 6 h | | `fio_bs` | string | fio block size | 4 k | 4 k | *(inherit)* | | `fio_rw` | string | fio I/O pattern | randrw | randrw | *(inherit)* | | `verify` | string | fio integrity mode (`md5` or empty) | md5 | md5 | *(inherit)* | #### `network` | Knob | Type | Description | quick | deep | soak | |------|------|-------------|-------|------|------| | `duration` | duration | `iperf3` test duration | 60 s | 30 m | 2 h | #### `burn` | Knob | Type | Description | quick | deep | soak | |------|------|-------------|-------|------|------| | `duration` | duration | Total burn-in window (CPU + mem + disk + net simultaneously) | 2 m | 2 h | 18 h | | `cpu_workers` | string | `all` (= `runtime.NumCPU()`) or a numeric string | all | all | *(inherit)* | | `mem_pct` | int | Percentage of MemAvailable to stress | 50 | 70 | *(inherit)* | | `fio_on_spare` | bool | Run fio inside Burn (requires a spare partition) | true | true | *(inherit)* | | `iperf_parallel` | int | Parallel stream count fed to `iperf3 -P` | 2 | 4 | 8 | ### Example profile block ```yaml profiles: quick: stage_timeouts: CPUStress: 5m Storage: 5m Network: 2m defaults: cpustress: { cpu_pass: 2m, mem_pass: 2m, edac_poll: 10s } storage: { mode: fio_sample, fio_size: 1GiB, fio_time: 3m, fio_bs: 4k, fio_rw: randrw, verify: md5 } network: { duration: 60s } burn: { duration: 2m, cpu_workers: all, mem_pct: 50, fio_on_spare: true, iperf_parallel: 2 } deep: stage_timeouts: CPUStress: 2h Storage: 4h Network: 35m defaults: cpustress: { cpu_pass: 60m, mem_pass: 60m, edac_poll: 10s } storage: { mode: full_disk, fio_time: 2h, fio_bs: 4k, fio_rw: randrw, verify: md5 } network: { duration: 30m } burn: { duration: 2h, cpu_workers: all, mem_pct: 70, fio_on_spare: true, iperf_parallel: 4 } soak: inherit: deep stage_timeouts: CPUStress: 14h Storage: 8h Network: 2h30m defaults: cpustress: { cpu_pass: 12h } storage: { mode: full_disk, fio_time: 6h } network: { duration: 2h } burn: { duration: 18h, iperf_parallel: 8 } ``` --- ## Host-mode agent config The persistent host-mode agent reads a separate file at `/etc/vetting/host-agent.yaml`. This is installed by the quick-register one-liner and is distinct from the orchestrator config. | Key | Type | Default | Description | |-----|------|---------|-------------| | `orchestrator_url` | string | *(required)* | URL of the orchestrator (e.g. `http://192.168.1.135:8080`). | | `mac` | string | *(auto-detected)* | MAC address to heartbeat as. Auto-detected from the default route NIC if omitted. | | `interval` | duration | `30s` | Heartbeat interval. |