Configuration reference
The orchestrator reads a single YAML file at startup. Production
installs use /etc/vetting/vetting.yaml; the dev default is
deploy/vetting.example.yaml. Pass the path with --config:
Every key has a compile-time default (see internal/config/config.go),
so an empty file produces a working orchestrator bound to
127.0.0.1:8080 with PXE disabled.
server
| Key |
Type |
Default |
Description |
bind |
string |
127.0.0.1:8080 |
Address and port the HTTP server listens on. |
public_url |
string |
(empty) |
External URL the orchestrator is reachable at from a browser. Used in notification click-throughs (e.g. https://vetting.lan:8443). |
tls.enabled |
bool |
false |
Terminate TLS at the orchestrator. |
tls.cert_file |
string |
(empty) |
Path to the PEM-encoded certificate. |
tls.key_file |
string |
(empty) |
Path to the PEM-encoded private key. |
database
| Key |
Type |
Default |
Description |
path |
string |
./var/vetting.db |
SQLite database file. Created on first run. |
artifacts
| Key |
Type |
Default |
Description |
dir |
string |
./var/artifacts |
Directory for per-run files (reports, fio logs, iperf logs, hold keys). |
retention_days |
int |
30 |
Days to keep artifact files before the janitor prunes them. 0 = keep forever. DB rows are never pruned. |
logs
| Key |
Type |
Default |
Description |
dir |
string |
./var/logs |
Directory for per-run append-only log files. |
retention_days |
int |
30 |
Days to keep log files. 0 = keep forever. |
janitor
| Key |
Type |
Default |
Description |
interval_minutes |
int |
60 |
Minutes between cleanup sweeps. 0 defaults to 60. |
dispatcher
| Key |
Type |
Default |
Description |
max_concurrent_runs |
int |
3 |
Semaphore limiting how many vetting runs execute in parallel. |
network
| Key |
Type |
Default |
Description |
iperf_port |
int |
5201 |
Port the orchestrator-supervised iperf3 -s binds to. The agent connects here during the Network stage. |
pxe
PXE is disabled by default. Enable it after running
vetting-pxe-setup.
| Key |
Type |
Default |
Description |
enabled |
bool |
false |
Enable dnsmasq + iPXE serving. |
interface |
string |
(empty) |
LAN NIC the dnsmasq proxy-DHCP binds to (e.g. eth0). |
subnet |
string |
(empty) |
LAN CIDR (e.g. 192.168.1.0/24). Scopes the proxy-DHCP responses. |
orchestrator_url |
string |
(empty) |
URL the live-image agent uses to reach the orchestrator (e.g. http://192.168.1.135:8080). Baked into the iPXE kernel cmdline. |
tftp_root |
string |
(empty) |
Directory containing ipxe.efi + undionly.kpxe. |
live_dir |
string |
(empty) |
Directory containing vmlinuz + initrd.img. Served at /live/*. |
dnsmasq runs in proxy-DHCP mode: it coexists with your existing
router's DHCP server and only supplements PXE options. See
operations.md for the full setup
walkthrough.
agent
| Key |
Type |
Default |
Description |
asset_dir |
string |
<database.dir>/../assets |
Directory containing vetting-agent-linux-amd64. Served at /assets/* so the quick-register one-liner can download the agent binary. Empty string disables the route. |
notifiers
An array of notification targets. Each entry declares a named notifier
with a type-specific set of fields. Delivery is fire-and-forget (one
attempt per event, 10 s timeout, failures logged).
ntfy
| Field |
Type |
Description |
name |
string |
Identifier referenced by routes[].notifier. |
type |
string |
ntfy |
server |
string |
ntfy server URL. |
topic |
string |
Topic to publish to. |
Discord
| Field |
Type |
Description |
name |
string |
Identifier referenced by routes[].notifier. |
type |
string |
discord |
webhook_url |
string |
Discord webhook URL. |
SMTP
| Field |
Type |
Description |
name |
string |
Identifier referenced by routes[].notifier. |
type |
string |
smtp |
smtp.host |
string |
SMTP server hostname. |
smtp.port |
int |
SMTP server port. |
smtp.from |
string |
Sender address. |
smtp.to |
string[] |
Recipient addresses. |
routes
Routes map notification events to notifiers by kind and severity.
Each route is evaluated independently; an event can match multiple
routes and fire on multiple notifiers.
| Field |
Type |
Description |
match_kind |
string[] |
Event kinds to match: StageFailed, SpecMismatch, HoldingOpened, RunCompleted. Omit to match all kinds. |
match_severity |
string[] |
Severities to match: critical, warning, info. Omit to match all severities. |
notifier |
string |
Name of a declared notifier to deliver to. |
vetting
Shared pipeline defaults that apply to all profiles.
vetting.stages
Ordered list of stage names the pipeline walks. Default:
vetting.thresholds
Array of threshold rules evaluated against every /sensor batch.
Rules apply across all profiles — a 92 C CPU limit fails both a
2-minute quick run and a 12-hour soak.
| Field |
Type |
Description |
stage |
string |
Stage selector. * matches any stage; exact name (e.g. PSU) limits to that stage. |
kind |
string |
Measurement kind to match: temp, psu_volt, iperf, fio_p99_us, nic_retrans, edac_ue, edac_ce, mce, smart_attr, fan. |
key |
string |
Key selector. Glob-ish matching: * matches all, cpu/* matches keys starting with cpu/, exact string for specific keys. |
op |
string |
Comparison operator (see table below). |
value |
float |
Threshold limit. |
nominal |
float |
Reference value, only used by within_pct (e.g. 12.0 for a +12 V rail). |
unit |
string |
Display unit (e.g. C, V, Mbps). Informational only. |
severity |
string |
critical = fail the run immediately. warning = record for the report only. |
Threshold operators:
| Operator |
Pass condition |
Typical use |
lt |
observed < value |
CPU temp < 92 C |
lte |
observed <= value |
EDAC UE count <= 0 |
gt |
observed > value |
— |
gte |
observed >= value |
iperf throughput >= 900 Mbps |
within_pct |
abs(observed - nominal) / nominal * 100 <= value |
+12 V rail within 5 % of 12.0 V |
Default thresholds (from deploy/vetting.example.yaml):
thresholds:
- { stage: "*", kind: temp, key: "cpu/*", op: lt, value: 92, unit: C, severity: critical }
- { stage: PSU, kind: psu_volt, key: "+12V", op: within_pct, value: 5, nominal: 12.0, severity: critical }
- { stage: PSU, kind: psu_volt, key: "+5V", op: within_pct, value: 5, nominal: 5.0, severity: critical }
- { stage: PSU, kind: psu_volt, key: "+3.3V", op: within_pct, value: 5, nominal: 3.3, severity: critical }
- { stage: Storage, kind: fio_p99_us, key: "*", op: lt, value: 50000, severity: warning }
- { stage: Network, kind: iperf, key: throughput_mbps, op: gte, value: 900, severity: critical }
- { stage: Network, kind: nic_retrans, key: "*/rate", op: lt, value: 0.001, severity: warning }
- { stage: CPUStress, kind: edac_ue, key: "*", op: lte, value: 0, severity: critical }
- { stage: CPUStress, kind: mce, key: "*", op: lte, value: 0, severity: critical }
profiles
Three built-in profiles control per-stage durations and probe knobs.
Every profile exercises every probe and gate — only the durations
scale. Quick is a ~10-minute same-day sanity check; deep is the
8-12 hour overnight soak; soak is the opt-in 36-40 hour extreme run.
Profile inheritance
A profile can declare inherit: <parent> to merge the parent's
timeouts and defaults before applying its own overrides. Child keys
win. The default soak profile inherits from deep.
stage_timeouts
Per-stage time limits. The orchestrator kills the agent's stage
subprocess when a timeout fires.
| Stage |
quick |
deep |
soak |
| CPUStress |
5 m |
2 h |
14 h |
| Storage |
5 m |
4 h |
8 h |
| Network |
2 m |
35 m |
2 h 30 m |
| Burn |
3 m |
3 h |
20 h |
| PSU |
1 m |
10 m |
15 m |
defaults
Per-stage probe knobs shipped to the agent on /claim. Empty values
mean "fall back to the agent's compile-time default".
cpustress
| Knob |
Type |
Description |
quick |
deep |
soak |
cpu_pass |
duration |
stress-ng --cpu duration |
2 m |
60 m |
12 h |
mem_pass |
duration |
stress-ng --vm duration |
2 m |
60 m |
(inherit) |
edac_poll |
duration |
EDAC error counter polling interval |
10 s |
10 s |
(inherit) |
storage
| Knob |
Type |
Description |
quick |
deep |
soak |
mode |
string |
fio_sample (skip badblocks) or full_disk (badblocks + fio) |
fio_sample |
full_disk |
full_disk |
fio_size |
string |
fio test file size (only in fio_sample mode) |
1 GiB |
(inherit) |
(inherit) |
fio_time |
duration |
fio runtime |
3 m |
2 h |
6 h |
fio_bs |
string |
fio block size |
4 k |
4 k |
(inherit) |
fio_rw |
string |
fio I/O pattern |
randrw |
randrw |
(inherit) |
verify |
string |
fio integrity mode (md5 or empty) |
md5 |
md5 |
(inherit) |
network
| Knob |
Type |
Description |
quick |
deep |
soak |
duration |
duration |
iperf3 test duration |
60 s |
30 m |
2 h |
burn
| Knob |
Type |
Description |
quick |
deep |
soak |
duration |
duration |
Total burn-in window (CPU + mem + disk + net simultaneously) |
2 m |
2 h |
18 h |
cpu_workers |
string |
all (= runtime.NumCPU()) or a numeric string |
all |
all |
(inherit) |
mem_pct |
int |
Percentage of MemAvailable to stress |
50 |
70 |
(inherit) |
fio_on_spare |
bool |
Run fio inside Burn (requires a spare partition) |
true |
true |
(inherit) |
iperf_parallel |
int |
Parallel stream count fed to iperf3 -P |
2 |
4 |
8 |
Example profile block
Host-mode agent config
The persistent host-mode agent reads a separate file at
/etc/vetting/host-agent.yaml. This is installed by the
quick-register one-liner and is distinct from the orchestrator config.
| Key |
Type |
Default |
Description |
orchestrator_url |
string |
(required) |
URL of the orchestrator (e.g. http://192.168.1.135:8080). |
mac |
string |
(auto-detected) |
MAC address to heartbeat as. Auto-detected from the default route NIC if omitted. |
interval |
duration |
30s |
Heartbeat interval. |