# Configuration reference

The orchestrator reads a single YAML file at startup. Production
installs use `/etc/vetting/vetting.yaml`; the dev default is
`deploy/vetting.example.yaml`. Pass the path with `--config`:

```
vetting --config /etc/vetting/vetting.yaml
```

Every key has a compile-time default (see `internal/config/config.go`),
so an empty file produces a working orchestrator bound to
`127.0.0.1:8080` with PXE disabled.

---

## `server`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `bind` | string | `127.0.0.1:8080` | Address and port the HTTP server listens on. |
| `public_url` | string | *(empty)* | External URL the orchestrator is reachable at from a browser. Used in notification click-throughs (e.g. `https://vetting.lan:8443`). |
| `tls.enabled` | bool | `false` | Terminate TLS at the orchestrator. |
| `tls.cert_file` | string | *(empty)* | Path to the PEM-encoded certificate. |
| `tls.key_file` | string | *(empty)* | Path to the PEM-encoded private key. |

## `database`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `path` | string | `./var/vetting.db` | SQLite database file. Created on first run. |

## `artifacts`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `dir` | string | `./var/artifacts` | Directory for per-run files (reports, fio logs, iperf logs, hold keys). |
| `retention_days` | int | `30` | Days to keep artifact files before the janitor prunes them. `0` = keep forever. DB rows are never pruned. |

## `logs`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `dir` | string | `./var/logs` | Directory for per-run append-only log files. |
| `retention_days` | int | `30` | Days to keep log files. `0` = keep forever. |

## `janitor`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `interval_minutes` | int | `60` | Minutes between cleanup sweeps. `0` defaults to `60`. |

## `dispatcher`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `max_concurrent_runs` | int | `3` | Semaphore limiting how many vetting runs execute in parallel. |

## `network`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `iperf_port` | int | `5201` | Port the orchestrator-supervised `iperf3 -s` binds to. The agent connects here during the Network stage. |

## `pxe`

PXE is disabled by default. Enable it after running
[`vetting-pxe-setup`](operations.md#pxe-enablement).

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `enabled` | bool | `false` | Enable dnsmasq + iPXE serving. |
| `interface` | string | *(empty)* | LAN NIC the dnsmasq proxy-DHCP binds to (e.g. `eth0`). |
| `subnet` | string | *(empty)* | LAN CIDR (e.g. `192.168.1.0/24`). Scopes the proxy-DHCP responses. |
| `orchestrator_url` | string | *(empty)* | URL the live-image agent uses to reach the orchestrator (e.g. `http://192.168.1.135:8080`). Baked into the iPXE kernel cmdline. |
| `tftp_root` | string | *(empty)* | Directory containing `ipxe.efi` + `undionly.kpxe`. |
| `live_dir` | string | *(empty)* | Directory containing `vmlinuz` + `initrd.img`. Served at `/live/*`. |

dnsmasq runs in **proxy-DHCP mode**: it coexists with your existing
router's DHCP server and only supplements PXE options. See
[operations.md](operations.md#pxe-enablement) for the full setup
walkthrough.

## `agent`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `asset_dir` | string | `<database.dir>/../assets` | Directory containing `vetting-agent-linux-amd64`. Served at `/assets/*` so the quick-register one-liner can download the agent binary. Empty string disables the route. |

## `notifiers`

An array of notification targets. Each entry declares a named notifier
with a type-specific set of fields. Delivery is fire-and-forget (one
attempt per event, 10 s timeout, failures logged).

### ntfy

```yaml
notifiers:
  - name: ops-ntfy
    type: ntfy
    server: https://ntfy.sh
    topic: vetting-YOUR-TOPIC
```

| Field | Type | Description |
|-------|------|-------------|
| `name` | string | Identifier referenced by `routes[].notifier`. |
| `type` | string | `ntfy` |
| `server` | string | ntfy server URL. |
| `topic` | string | Topic to publish to. |

### Discord

```yaml
notifiers:
  - name: ops-discord
    type: discord
    webhook_url: https://discord.com/api/webhooks/XXX/YYY
```

| Field | Type | Description |
|-------|------|-------------|
| `name` | string | Identifier referenced by `routes[].notifier`. |
| `type` | string | `discord` |
| `webhook_url` | string | Discord webhook URL. |

### SMTP

```yaml
notifiers:
  - name: ops-email
    type: smtp
    smtp:
      host: mail.lan
      port: 25
      from: vetting@lan.local
      to: [ops@lan.local]
```

| Field | Type | Description |
|-------|------|-------------|
| `name` | string | Identifier referenced by `routes[].notifier`. |
| `type` | string | `smtp` |
| `smtp.host` | string | SMTP server hostname. |
| `smtp.port` | int | SMTP server port. |
| `smtp.from` | string | Sender address. |
| `smtp.to` | string[] | Recipient addresses. |

## `routes`

Routes map notification events to notifiers by kind and severity.
Each route is evaluated independently; an event can match multiple
routes and fire on multiple notifiers.

```yaml
routes:
  - match_severity: [critical]
    notifier: ops-ntfy
  - match_severity: [critical]
    notifier: ops-discord
  - match_kind: [RunCompleted]
    notifier: ops-ntfy
```

| Field | Type | Description |
|-------|------|-------------|
| `match_kind` | string[] | Event kinds to match: `StageFailed`, `SpecMismatch`, `HoldingOpened`, `RunCompleted`. Omit to match all kinds. |
| `match_severity` | string[] | Severities to match: `critical`, `warning`, `info`. Omit to match all severities. |
| `notifier` | string | Name of a declared notifier to deliver to. |

## `vetting`

Shared pipeline defaults that apply to all profiles.

### `vetting.stages`

Ordered list of stage names the pipeline walks. Default:

```yaml
vetting:
  stages:
    - Inventory
    - Firmware
    - SpecValidate
    - SMART
    - CPUStress
    - Storage
    - Network
    - Burn
    - GPU
    - PSU
    - Reporting
```

### `vetting.thresholds`

Array of threshold rules evaluated against every `/sensor` batch.
Rules apply across all profiles — a 92 C CPU limit fails both a
2-minute quick run and a 12-hour soak.

| Field | Type | Description |
|-------|------|-------------|
| `stage` | string | Stage selector. `*` matches any stage; exact name (e.g. `PSU`) limits to that stage. |
| `kind` | string | Measurement kind to match: `temp`, `psu_volt`, `iperf`, `fio_p99_us`, `nic_retrans`, `edac_ue`, `edac_ce`, `mce`, `smart_attr`, `fan`. |
| `key` | string | Key selector. Glob-ish matching: `*` matches all, `cpu/*` matches keys starting with `cpu/`, exact string for specific keys. |
| `op` | string | Comparison operator (see table below). |
| `value` | float | Threshold limit. |
| `nominal` | float | Reference value, only used by `within_pct` (e.g. `12.0` for a +12 V rail). |
| `unit` | string | Display unit (e.g. `C`, `V`, `Mbps`). Informational only. |
| `severity` | string | `critical` = fail the run immediately. `warning` = record for the report only. |

**Threshold operators:**

| Operator | Pass condition | Typical use |
|----------|---------------|-------------|
| `lt` | `observed < value` | CPU temp < 92 C |
| `lte` | `observed <= value` | EDAC UE count <= 0 |
| `gt` | `observed > value` | — |
| `gte` | `observed >= value` | iperf throughput >= 900 Mbps |
| `within_pct` | `abs(observed - nominal) / nominal * 100 <= value` | +12 V rail within 5 % of 12.0 V |

**Default thresholds** (from `deploy/vetting.example.yaml`):

```yaml
thresholds:
  - { stage: "*",       kind: temp,        key: "cpu/*",         op: lt,         value: 92,    unit: C,    severity: critical }
  - { stage: PSU,       kind: psu_volt,    key: "+12V",          op: within_pct, value: 5,     nominal: 12.0, severity: critical }
  - { stage: PSU,       kind: psu_volt,    key: "+5V",           op: within_pct, value: 5,     nominal: 5.0,  severity: critical }
  - { stage: PSU,       kind: psu_volt,    key: "+3.3V",         op: within_pct, value: 5,     nominal: 3.3,  severity: critical }
  - { stage: Storage,   kind: fio_p99_us,  key: "*",             op: lt,         value: 50000, severity: warning  }
  - { stage: Network,   kind: iperf,       key: throughput_mbps, op: gte,        value: 900,   severity: critical }
  - { stage: Network,   kind: nic_retrans, key: "*/rate",        op: lt,         value: 0.001, severity: warning  }
  - { stage: CPUStress, kind: edac_ue,     key: "*",             op: lte,        value: 0,     severity: critical }
  - { stage: CPUStress, kind: mce,         key: "*",             op: lte,        value: 0,     severity: critical }
```

## `profiles`

Three built-in profiles control per-stage durations and probe knobs.
Every profile exercises every probe and gate — only the durations
scale. Quick is a ~10-minute same-day sanity check; deep is the
8-12 hour overnight soak; soak is the opt-in 36-40 hour extreme run.

### Profile inheritance

A profile can declare `inherit: <parent>` to merge the parent's
timeouts and defaults before applying its own overrides. Child keys
win. The default `soak` profile inherits from `deep`.

### `stage_timeouts`

Per-stage time limits. The orchestrator kills the agent's stage
subprocess when a timeout fires.

| Stage | quick | deep | soak |
|-------|-------|------|------|
| CPUStress | 5 m | 2 h | 14 h |
| Storage | 5 m | 4 h | 8 h |
| Network | 2 m | 35 m | 2 h 30 m |
| Burn | 3 m | 3 h | 20 h |
| PSU | 1 m | 10 m | 15 m |

### `defaults`

Per-stage probe knobs shipped to the agent on `/claim`. Empty values
mean "fall back to the agent's compile-time default".

#### `cpustress`

| Knob | Type | Description | quick | deep | soak |
|------|------|-------------|-------|------|------|
| `cpu_pass` | duration | `stress-ng --cpu` duration | 2 m | 60 m | 12 h |
| `mem_pass` | duration | `stress-ng --vm` duration | 2 m | 60 m | *(inherit)* |
| `edac_poll` | duration | EDAC error counter polling interval | 10 s | 10 s | *(inherit)* |

#### `storage`

| Knob | Type | Description | quick | deep | soak |
|------|------|-------------|-------|------|------|
| `mode` | string | `fio_sample` (skip badblocks) or `full_disk` (badblocks + fio) | fio_sample | full_disk | full_disk |
| `fio_size` | string | fio test file size (only in `fio_sample` mode) | 1 GiB | *(inherit)* | *(inherit)* |
| `fio_time` | duration | fio runtime | 3 m | 2 h | 6 h |
| `fio_bs` | string | fio block size | 4 k | 4 k | *(inherit)* |
| `fio_rw` | string | fio I/O pattern | randrw | randrw | *(inherit)* |
| `verify` | string | fio integrity mode (`md5` or empty) | md5 | md5 | *(inherit)* |

#### `network`

| Knob | Type | Description | quick | deep | soak |
|------|------|-------------|-------|------|------|
| `duration` | duration | `iperf3` test duration | 60 s | 30 m | 2 h |

#### `burn`

| Knob | Type | Description | quick | deep | soak |
|------|------|-------------|-------|------|------|
| `duration` | duration | Total burn-in window (CPU + mem + disk + net simultaneously) | 2 m | 2 h | 18 h |
| `cpu_workers` | string | `all` (= `runtime.NumCPU()`) or a numeric string | all | all | *(inherit)* |
| `mem_pct` | int | Percentage of MemAvailable to stress | 50 | 70 | *(inherit)* |
| `fio_on_spare` | bool | Run fio inside Burn (requires a spare partition) | true | true | *(inherit)* |
| `iperf_parallel` | int | Parallel stream count fed to `iperf3 -P` | 2 | 4 | 8 |

### Example profile block

```yaml
profiles:
  quick:
    stage_timeouts:
      CPUStress: 5m
      Storage:   5m
      Network:   2m
    defaults:
      cpustress: { cpu_pass: 2m, mem_pass: 2m, edac_poll: 10s }
      storage:   { mode: fio_sample, fio_size: 1GiB, fio_time: 3m, fio_bs: 4k, fio_rw: randrw, verify: md5 }
      network:   { duration: 60s }
      burn:      { duration: 2m, cpu_workers: all, mem_pct: 50, fio_on_spare: true, iperf_parallel: 2 }
  deep:
    stage_timeouts:
      CPUStress: 2h
      Storage:   4h
      Network:   35m
    defaults:
      cpustress: { cpu_pass: 60m, mem_pass: 60m, edac_poll: 10s }
      storage:   { mode: full_disk, fio_time: 2h, fio_bs: 4k, fio_rw: randrw, verify: md5 }
      network:   { duration: 30m }
      burn:      { duration: 2h, cpu_workers: all, mem_pct: 70, fio_on_spare: true, iperf_parallel: 4 }
  soak:
    inherit: deep
    stage_timeouts:
      CPUStress: 14h
      Storage:   8h
      Network:   2h30m
    defaults:
      cpustress: { cpu_pass: 12h }
      storage:   { mode: full_disk, fio_time: 6h }
      network:   { duration: 2h }
      burn:      { duration: 18h, iperf_parallel: 8 }
```

---

## Host-mode agent config

The persistent host-mode agent reads a separate file at
`/etc/vetting/host-agent.yaml`. This is installed by the
quick-register one-liner and is distinct from the orchestrator config.

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `orchestrator_url` | string | *(required)* | URL of the orchestrator (e.g. `http://192.168.1.135:8080`). |
| `mac` | string | *(auto-detected)* | MAC address to heartbeat as. Auto-detected from the default route NIC if omitted. |
| `interval` | duration | `30s` | Heartbeat interval. |