Files
Vetting/docs/configuration.md
josh 8367ec2a9f
CI / Lint + build + test (push) Successful in 1m36s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 49s
docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API
reference with full request/response schemas, database schema), expand
the README with a feature list and how-it-works walkthrough, fix
missing Firmware and Burn stages in architecture.md and test-suite.md,
add threshold engine and host-mode agent sections, and add godoc
comments to 11 packages and 6 model types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-23 18:37:26 -04:00

13 KiB

Configuration reference

The orchestrator reads a single YAML file at startup. Production installs use /etc/vetting/vetting.yaml; the dev default is deploy/vetting.example.yaml. Pass the path with --config:

vetting --config /etc/vetting/vetting.yaml

Every key has a compile-time default (see internal/config/config.go), so an empty file produces a working orchestrator bound to 127.0.0.1:8080 with PXE disabled.


server

Key Type Default Description
bind string 127.0.0.1:8080 Address and port the HTTP server listens on.
public_url string (empty) External URL the orchestrator is reachable at from a browser. Used in notification click-throughs (e.g. https://vetting.lan:8443).
tls.enabled bool false Terminate TLS at the orchestrator.
tls.cert_file string (empty) Path to the PEM-encoded certificate.
tls.key_file string (empty) Path to the PEM-encoded private key.

database

Key Type Default Description
path string ./var/vetting.db SQLite database file. Created on first run.

artifacts

Key Type Default Description
dir string ./var/artifacts Directory for per-run files (reports, fio logs, iperf logs, hold keys).
retention_days int 30 Days to keep artifact files before the janitor prunes them. 0 = keep forever. DB rows are never pruned.

logs

Key Type Default Description
dir string ./var/logs Directory for per-run append-only log files.
retention_days int 30 Days to keep log files. 0 = keep forever.

janitor

Key Type Default Description
interval_minutes int 60 Minutes between cleanup sweeps. 0 defaults to 60.

dispatcher

Key Type Default Description
max_concurrent_runs int 3 Semaphore limiting how many vetting runs execute in parallel.

network

Key Type Default Description
iperf_port int 5201 Port the orchestrator-supervised iperf3 -s binds to. The agent connects here during the Network stage.

pxe

PXE is disabled by default. Enable it after running vetting-pxe-setup.

Key Type Default Description
enabled bool false Enable dnsmasq + iPXE serving.
interface string (empty) LAN NIC the dnsmasq proxy-DHCP binds to (e.g. eth0).
subnet string (empty) LAN CIDR (e.g. 192.168.1.0/24). Scopes the proxy-DHCP responses.
orchestrator_url string (empty) URL the live-image agent uses to reach the orchestrator (e.g. http://192.168.1.135:8080). Baked into the iPXE kernel cmdline.
tftp_root string (empty) Directory containing ipxe.efi + undionly.kpxe.
live_dir string (empty) Directory containing vmlinuz + initrd.img. Served at /live/*.

dnsmasq runs in proxy-DHCP mode: it coexists with your existing router's DHCP server and only supplements PXE options. See operations.md for the full setup walkthrough.

agent

Key Type Default Description
asset_dir string <database.dir>/../assets Directory containing vetting-agent-linux-amd64. Served at /assets/* so the quick-register one-liner can download the agent binary. Empty string disables the route.

notifiers

An array of notification targets. Each entry declares a named notifier with a type-specific set of fields. Delivery is fire-and-forget (one attempt per event, 10 s timeout, failures logged).

ntfy

notifiers:
  - name: ops-ntfy
    type: ntfy
    server: https://ntfy.sh
    topic: vetting-YOUR-TOPIC
Field Type Description
name string Identifier referenced by routes[].notifier.
type string ntfy
server string ntfy server URL.
topic string Topic to publish to.

Discord

notifiers:
  - name: ops-discord
    type: discord
    webhook_url: https://discord.com/api/webhooks/XXX/YYY
Field Type Description
name string Identifier referenced by routes[].notifier.
type string discord
webhook_url string Discord webhook URL.

SMTP

notifiers:
  - name: ops-email
    type: smtp
    smtp:
      host: mail.lan
      port: 25
      from: vetting@lan.local
      to: [ops@lan.local]
Field Type Description
name string Identifier referenced by routes[].notifier.
type string smtp
smtp.host string SMTP server hostname.
smtp.port int SMTP server port.
smtp.from string Sender address.
smtp.to string[] Recipient addresses.

routes

Routes map notification events to notifiers by kind and severity. Each route is evaluated independently; an event can match multiple routes and fire on multiple notifiers.

routes:
  - match_severity: [critical]
    notifier: ops-ntfy
  - match_severity: [critical]
    notifier: ops-discord
  - match_kind: [RunCompleted]
    notifier: ops-ntfy
Field Type Description
match_kind string[] Event kinds to match: StageFailed, SpecMismatch, HoldingOpened, RunCompleted. Omit to match all kinds.
match_severity string[] Severities to match: critical, warning, info. Omit to match all severities.
notifier string Name of a declared notifier to deliver to.

vetting

Shared pipeline defaults that apply to all profiles.

vetting.stages

Ordered list of stage names the pipeline walks. Default:

vetting:
  stages:
    - Inventory
    - Firmware
    - SpecValidate
    - SMART
    - CPUStress
    - Storage
    - Network
    - Burn
    - GPU
    - PSU
    - Reporting

vetting.thresholds

Array of threshold rules evaluated against every /sensor batch. Rules apply across all profiles — a 92 C CPU limit fails both a 2-minute quick run and a 12-hour soak.

Field Type Description
stage string Stage selector. * matches any stage; exact name (e.g. PSU) limits to that stage.
kind string Measurement kind to match: temp, psu_volt, iperf, fio_p99_us, nic_retrans, edac_ue, edac_ce, mce, smart_attr, fan.
key string Key selector. Glob-ish matching: * matches all, cpu/* matches keys starting with cpu/, exact string for specific keys.
op string Comparison operator (see table below).
value float Threshold limit.
nominal float Reference value, only used by within_pct (e.g. 12.0 for a +12 V rail).
unit string Display unit (e.g. C, V, Mbps). Informational only.
severity string critical = fail the run immediately. warning = record for the report only.

Threshold operators:

Operator Pass condition Typical use
lt observed < value CPU temp < 92 C
lte observed <= value EDAC UE count <= 0
gt observed > value
gte observed >= value iperf throughput >= 900 Mbps
within_pct abs(observed - nominal) / nominal * 100 <= value +12 V rail within 5 % of 12.0 V

Default thresholds (from deploy/vetting.example.yaml):

thresholds:
  - { stage: "*",       kind: temp,        key: "cpu/*",         op: lt,         value: 92,    unit: C,    severity: critical }
  - { stage: PSU,       kind: psu_volt,    key: "+12V",          op: within_pct, value: 5,     nominal: 12.0, severity: critical }
  - { stage: PSU,       kind: psu_volt,    key: "+5V",           op: within_pct, value: 5,     nominal: 5.0,  severity: critical }
  - { stage: PSU,       kind: psu_volt,    key: "+3.3V",         op: within_pct, value: 5,     nominal: 3.3,  severity: critical }
  - { stage: Storage,   kind: fio_p99_us,  key: "*",             op: lt,         value: 50000, severity: warning  }
  - { stage: Network,   kind: iperf,       key: throughput_mbps, op: gte,        value: 900,   severity: critical }
  - { stage: Network,   kind: nic_retrans, key: "*/rate",        op: lt,         value: 0.001, severity: warning  }
  - { stage: CPUStress, kind: edac_ue,     key: "*",             op: lte,        value: 0,     severity: critical }
  - { stage: CPUStress, kind: mce,         key: "*",             op: lte,        value: 0,     severity: critical }

profiles

Three built-in profiles control per-stage durations and probe knobs. Every profile exercises every probe and gate — only the durations scale. Quick is a ~10-minute same-day sanity check; deep is the 8-12 hour overnight soak; soak is the opt-in 36-40 hour extreme run.

Profile inheritance

A profile can declare inherit: <parent> to merge the parent's timeouts and defaults before applying its own overrides. Child keys win. The default soak profile inherits from deep.

stage_timeouts

Per-stage time limits. The orchestrator kills the agent's stage subprocess when a timeout fires.

Stage quick deep soak
CPUStress 5 m 2 h 14 h
Storage 5 m 4 h 8 h
Network 2 m 35 m 2 h 30 m
Burn 3 m 3 h 20 h
PSU 1 m 10 m 15 m

defaults

Per-stage probe knobs shipped to the agent on /claim. Empty values mean "fall back to the agent's compile-time default".

cpustress

Knob Type Description quick deep soak
cpu_pass duration stress-ng --cpu duration 2 m 60 m 12 h
mem_pass duration stress-ng --vm duration 2 m 60 m (inherit)
edac_poll duration EDAC error counter polling interval 10 s 10 s (inherit)

storage

Knob Type Description quick deep soak
mode string fio_sample (skip badblocks) or full_disk (badblocks + fio) fio_sample full_disk full_disk
fio_size string fio test file size (only in fio_sample mode) 1 GiB (inherit) (inherit)
fio_time duration fio runtime 3 m 2 h 6 h
fio_bs string fio block size 4 k 4 k (inherit)
fio_rw string fio I/O pattern randrw randrw (inherit)
verify string fio integrity mode (md5 or empty) md5 md5 (inherit)

network

Knob Type Description quick deep soak
duration duration iperf3 test duration 60 s 30 m 2 h

burn

Knob Type Description quick deep soak
duration duration Total burn-in window (CPU + mem + disk + net simultaneously) 2 m 2 h 18 h
cpu_workers string all (= runtime.NumCPU()) or a numeric string all all (inherit)
mem_pct int Percentage of MemAvailable to stress 50 70 (inherit)
fio_on_spare bool Run fio inside Burn (requires a spare partition) true true (inherit)
iperf_parallel int Parallel stream count fed to iperf3 -P 2 4 8

Example profile block

profiles:
  quick:
    stage_timeouts:
      CPUStress: 5m
      Storage:   5m
      Network:   2m
    defaults:
      cpustress: { cpu_pass: 2m, mem_pass: 2m, edac_poll: 10s }
      storage:   { mode: fio_sample, fio_size: 1GiB, fio_time: 3m, fio_bs: 4k, fio_rw: randrw, verify: md5 }
      network:   { duration: 60s }
      burn:      { duration: 2m, cpu_workers: all, mem_pct: 50, fio_on_spare: true, iperf_parallel: 2 }
  deep:
    stage_timeouts:
      CPUStress: 2h
      Storage:   4h
      Network:   35m
    defaults:
      cpustress: { cpu_pass: 60m, mem_pass: 60m, edac_poll: 10s }
      storage:   { mode: full_disk, fio_time: 2h, fio_bs: 4k, fio_rw: randrw, verify: md5 }
      network:   { duration: 30m }
      burn:      { duration: 2h, cpu_workers: all, mem_pct: 70, fio_on_spare: true, iperf_parallel: 4 }
  soak:
    inherit: deep
    stage_timeouts:
      CPUStress: 14h
      Storage:   8h
      Network:   2h30m
    defaults:
      cpustress: { cpu_pass: 12h }
      storage:   { mode: full_disk, fio_time: 6h }
      network:   { duration: 2h }
      burn:      { duration: 18h, iperf_parallel: 8 }

Host-mode agent config

The persistent host-mode agent reads a separate file at /etc/vetting/host-agent.yaml. This is installed by the quick-register one-liner and is distinct from the orchestrator config.

Key Type Default Description
orchestrator_url string (required) URL of the orchestrator (e.g. http://192.168.1.135:8080).
mac string (auto-detected) MAC address to heartbeat as. Auto-detected from the default route NIC if omitted.
interval duration 30s Heartbeat interval.