Files
josh 8367ec2a9f
CI / Lint + build + test (push) Successful in 1m36s
Release / detect (push) Successful in 5s
Release / build-live-image (push) Has been skipped
Release / bundle (push) Successful in 49s
docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API
reference with full request/response schemas, database schema), expand
the README with a feature list and how-it-works walkthrough, fix
missing Firmware and Burn stages in architecture.md and test-suite.md,
add threshold engine and host-mode agent sections, and add godoc
comments to 11 packages and 6 model types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-23 18:37:26 -04:00

280 lines
12 KiB
Markdown

# Database schema
The orchestrator uses SQLite via
[modernc.org/sqlite](https://pkg.go.dev/modernc.org/sqlite) — a pure
Go driver with no cgo dependency. The database file is created on
first startup at the path in `database.path`
(default `./var/vetting.db`).
**Pragmas set at open time:**
- `PRAGMA journal_mode = WAL` — write-ahead logging for concurrent
readers.
- `PRAGMA foreign_keys = ON` — enforced referential integrity.
**Migrations** are embedded via `go:embed` in `internal/db/` and
applied in filename order at startup. A `schema_migrations` table
tracks which migrations have run.
---
## Tables
### `hosts`
Registered hardware nodes in the vetting cluster.
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| `id` | INTEGER | PK AUTOINCREMENT | | |
| `name` | TEXT | NOT NULL UNIQUE | | Human-readable host name. |
| `mac` | TEXT | NOT NULL UNIQUE | | Lowercase colon form (e.g. `aa:bb:cc:dd:ee:ff`). |
| `wol_broadcast_ip` | TEXT | NOT NULL | | LAN broadcast IP for Wake-on-LAN magic packets. |
| `wol_port` | INTEGER | NOT NULL | `9` | WoL UDP port. |
| `expected_spec_yaml` | TEXT | NOT NULL | | YAML describing expected hardware (CPU, memory, disks, firmware). |
| `pdu_config_json` | TEXT | | | PDU power control config (future use). |
| `ipmi_config_json` | TEXT | | | IPMI config (future use). |
| `notes` | TEXT | NOT NULL | `''` | Operator notes. |
| `created_at` | TIMESTAMP | NOT NULL | `CURRENT_TIMESTAMP` | |
| `updated_at` | TIMESTAMP | NOT NULL | `CURRENT_TIMESTAMP` | |
| `last_seen_at` | TIMESTAMP | | | Host-mode agent heartbeat timestamp. NULL = never seen. |
### `runs`
Vetting run instances. Each run belongs to one host and walks through
the state machine.
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| `id` | INTEGER | PK AUTOINCREMENT | | |
| `host_id` | INTEGER | NOT NULL FK → hosts(id) CASCADE | | |
| `state` | TEXT | NOT NULL | | Current `RunState` (see `internal/model`). |
| `result` | TEXT | | | `pass` or `fail` once terminal. |
| `failed_stage` | TEXT | | | Stage name that halted the pipeline. |
| `next_boot_target` | TEXT | | | `linux`, `memtest`, etc. (future use). |
| `agent_token_hash` | TEXT | NOT NULL | | SHA-256 hash of the bearer token. |
| `started_at` | TIMESTAMP | NOT NULL | `CURRENT_TIMESTAMP` | |
| `completed_at` | TIMESTAMP | | | Set when run reaches a terminal state. |
| `report_path` | TEXT | | | Path to `report.json` on disk. |
| `hold_ip` | TEXT | | | Agent IP during FailedHolding (for SSH command). |
| `override_flags_json` | TEXT | | | JSON blob (e.g. `{"wipe": true}`). |
| `non_destructive` | INTEGER | NOT NULL | `0` | `1` = skip badblocks + wipe probe. |
| `profile` | TEXT | NOT NULL | `'quick'` | `quick`, `deep`, or `soak`. |
**Indices:**
- `idx_runs_host` on `(host_id)`
- `idx_runs_state` on `(state)`
### `stages`
Per-stage results within a run. Seeded at `/claim` time with one row
per stage in `DefaultStageOrder`.
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| `id` | INTEGER | PK AUTOINCREMENT | | |
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
| `name` | TEXT | NOT NULL | | Stage name (e.g. `SMART`, `CPUStress`). |
| `ordinal` | INTEGER | NOT NULL | | 0-based position in the pipeline. |
| `state` | TEXT | NOT NULL | | `pending`, `running`, `passed`, `failed`, `skipped`. |
| `started_at` | TIMESTAMP | | | Set when the stage begins. |
| `completed_at` | TIMESTAMP | | | Set when the stage finishes. |
| `summary_json` | TEXT | | | Arbitrary JSON from the agent's result. |
**Indices:**
- `idx_stages_run_ordinal` on `(run_id, ordinal)`
### `sub_steps`
Finer-grained units within a stage (per-disk SMART, per-NIC iperf,
CPU/memory pass, per-GPU run). Not every stage has sub-steps.
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| `id` | INTEGER | PK AUTOINCREMENT | | |
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
| `stage_name` | TEXT | NOT NULL | | Parent stage name. |
| `ordinal` | INTEGER | NOT NULL | | 0-based within `(run_id, stage_name)`. |
| `name` | TEXT | NOT NULL | | Human label (e.g. `sda SMART`, `eth0 iperf`). |
| `state` | TEXT | NOT NULL | `'pending'` | `pending`, `running`, `passed`, `failed`, `skipped`. |
| `started_at` | TIMESTAMP | | | |
| `completed_at` | TIMESTAMP | | | |
| `summary_json` | TEXT | NOT NULL | `'{}'` | |
**Constraints:** `UNIQUE (run_id, stage_name, ordinal)`
**Indices:** `idx_sub_steps_run` on `(run_id, stage_name, ordinal)`
### `measurements`
Time-series sensor data from the thermal sidecar and stage executors.
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| `id` | INTEGER | PK AUTOINCREMENT | | |
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
| `stage_id` | INTEGER | FK → stages(id) SET NULL | | Optional link to a specific stage. |
| `ts` | TIMESTAMP | NOT NULL | | Sample timestamp. |
| `kind` | TEXT | NOT NULL | | `temp`, `power`, `iperf`, `fio`, `smart_attr`, `psu_volt`, `fan`, etc. |
| `key` | TEXT | NOT NULL | | Source identifier (e.g. `cpu/0`, `+12V`). |
| `value` | REAL | | | Numeric sample. |
| `unit` | TEXT | | | Display unit. |
**Indices:** `idx_measurements_run_kind_ts` on `(run_id, kind, ts)`
### `artifacts`
On-disk file references (reports, fio logs, iperf logs, hold keys).
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| `id` | INTEGER | PK AUTOINCREMENT | | |
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
| `stage_id` | INTEGER | FK → stages(id) SET NULL | | |
| `kind` | TEXT | NOT NULL | | `inventory`, `report`, `report_html`, `hold_key`, `fio`, `iperf`. |
| `path` | TEXT | NOT NULL | | Absolute path on disk. |
| `sha256` | TEXT | NOT NULL | | SHA-256 hex digest. |
| `size_bytes` | INTEGER | NOT NULL | | File size. |
### `spec_diffs`
Expected-vs-actual hardware divergences from SpecValidate.
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| `id` | INTEGER | PK AUTOINCREMENT | | |
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
| `field` | TEXT | NOT NULL | | Dotted path (e.g. `memory.total_gib`, `cpu.logical_cores`). |
| `expected` | TEXT | | | Expected value from the host's spec YAML. |
| `actual` | TEXT | | | Observed value from the inventory probe. |
| `severity` | TEXT | NOT NULL | | `critical`, `warning`, `info`. |
| `ignored` | INTEGER | NOT NULL | `0` | `1` = operator chose to ignore this diff. |
### `thresholds`
Per-run threshold rules, seeded from the `ProfileRegistry` + per-host
overrides at run creation. Immutable for the run's lifetime.
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| `id` | INTEGER | PK AUTOINCREMENT | | |
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
| `stage_name` | TEXT | NOT NULL | | `*` matches any stage. |
| `kind` | TEXT | NOT NULL | | Measurement kind to match. |
| `key` | TEXT | NOT NULL | | Key selector (glob-ish). |
| `op` | TEXT | NOT NULL | | `lt`, `lte`, `gt`, `gte`, `within_pct`. |
| `threshold` | REAL | NOT NULL | | Limit value. |
| `nominal` | REAL | NOT NULL | `0` | Reference for `within_pct`. |
| `unit` | TEXT | NOT NULL | `''` | Display unit. |
| `severity` | TEXT | NOT NULL | | `critical` or `warning`. |
| `source` | TEXT | NOT NULL | | `profile` or `host_override`. |
**Indices:**
- `idx_thresholds_run` on `(run_id)`
- `idx_thresholds_kind` on `(run_id, stage_name, kind)`
### `threshold_evaluations`
Per-sample pass/fail results from threshold evaluation. Drives
report badges and pipeline verdict rendering.
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| `id` | INTEGER | PK AUTOINCREMENT | | |
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
| `threshold_id` | INTEGER | NOT NULL FK → thresholds(id) CASCADE | | |
| `stage_name` | TEXT | NOT NULL | | Stage the sample belongs to. |
| `kind` | TEXT | NOT NULL | | Measurement kind. |
| `key` | TEXT | NOT NULL | | Source key. |
| `ts` | TIMESTAMP | NOT NULL | | Sample timestamp. |
| `observed` | REAL | NOT NULL | | Observed value. |
| `passed` | INTEGER | NOT NULL | | `1` = within threshold, `0` = breach. |
**Indices:** `idx_threshold_evals_run` on `(run_id, passed)`
### `firmware_snapshots`
Per-run firmware version captures (BIOS, BMC, NIC, HBA, microcode,
NVMe). Populated by the Firmware stage; consumed by SpecValidate for
firmware version diffing.
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| `id` | INTEGER | PK AUTOINCREMENT | | |
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
| `component` | TEXT | NOT NULL | | `bios`, `bmc`, `nic`, `hba`, `microcode`, `nvme_fw`. |
| `identifier` | TEXT | NOT NULL | | Slot, serial, or device path distinguishing this component. |
| `version` | TEXT | NOT NULL | | Firmware version string. |
| `vendor` | TEXT | NOT NULL | `''` | |
| `raw_json` | TEXT | NOT NULL | `'{}'` | Additional metadata. |
**Indices:** `idx_firmware_run` on `(run_id, component)`
### `events`
Event log table. Reserved for future use.
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| `id` | INTEGER | PK AUTOINCREMENT | | |
| `run_id` | INTEGER | FK → runs(id) CASCADE | | |
| `host_id` | INTEGER | FK → hosts(id) CASCADE | | |
| `ts` | TIMESTAMP | NOT NULL | | |
| `level` | TEXT | NOT NULL | | |
| `kind` | TEXT | NOT NULL | | |
| `message` | TEXT | NOT NULL | | |
| `data_json` | TEXT | | | |
### `settings`
Key-value store for orchestrator-level settings.
| Column | Type | Constraints | Description |
|--------|------|-------------|-------------|
| `key` | TEXT | PK | |
| `value` | TEXT | NOT NULL | |
---
## Entity relationships
```
hosts 1───N runs
├──N stages
│ └──(FK) measurements (stage_id, SET NULL)
│ └──(FK) artifacts (stage_id, SET NULL)
├──N sub_steps
├──N measurements (run_id)
├──N artifacts (run_id)
├──N spec_diffs
├──N thresholds
│ └──N threshold_evaluations
└──N firmware_snapshots
```
All foreign keys use `ON DELETE CASCADE` (except `stage_id` references
which use `SET NULL`). Deleting a host cascades through its runs and
all dependent rows.
## Data retention
The janitor goroutine prunes **on-disk files** (artifacts, logs) based
on `artifacts.retention_days` and `logs.retention_days`. **Database
rows are never deleted** by the janitor — run histories, measurement
time-series, spec diffs, and threshold evaluations survive cleanups
indefinitely.
See [architecture.md § Data retention](architecture.md#data-retention)
and [configuration.md § janitor](configuration.md#janitor).
## Migration history
| File | What it adds |
|------|-------------|
| `0001_init.sql` | Core schema: `hosts`, `runs`, `stages`, `measurements`, `artifacts`, `spec_diffs`, `events`, `settings`. |
| `0002_add_hosts_last_seen_at.sql` | `hosts.last_seen_at` column for host-mode agent heartbeats. |
| `0003_add_runs_non_destructive.sql` | `runs.non_destructive` boolean flag. |
| `0004_add_sub_steps.sql` | `sub_steps` table for per-disk/per-NIC granular stage detail. |
| `0005_profiles_thresholds_firmware.sql` | `runs.profile` column, `thresholds` + `threshold_evaluations` tables, `firmware_snapshots` table. |
All migrations are additive — no schema deletions or renames.