docs: comprehensive documentation expansion
Add 4 new doc files (configuration reference, development guide, API reference with full request/response schemas, database schema), expand the README with a feature list and how-it-works walkthrough, fix missing Firmware and Burn stages in architecture.md and test-suite.md, add threshold engine and host-mode agent sections, and add godoc comments to 11 packages and 6 model types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,279 @@
|
||||
# Database schema
|
||||
|
||||
The orchestrator uses SQLite via
|
||||
[modernc.org/sqlite](https://pkg.go.dev/modernc.org/sqlite) — a pure
|
||||
Go driver with no cgo dependency. The database file is created on
|
||||
first startup at the path in `database.path`
|
||||
(default `./var/vetting.db`).
|
||||
|
||||
**Pragmas set at open time:**
|
||||
|
||||
- `PRAGMA journal_mode = WAL` — write-ahead logging for concurrent
|
||||
readers.
|
||||
- `PRAGMA foreign_keys = ON` — enforced referential integrity.
|
||||
|
||||
**Migrations** are embedded via `go:embed` in `internal/db/` and
|
||||
applied in filename order at startup. A `schema_migrations` table
|
||||
tracks which migrations have run.
|
||||
|
||||
---
|
||||
|
||||
## Tables
|
||||
|
||||
### `hosts`
|
||||
|
||||
Registered hardware nodes in the vetting cluster.
|
||||
|
||||
| Column | Type | Constraints | Default | Description |
|
||||
|--------|------|-------------|---------|-------------|
|
||||
| `id` | INTEGER | PK AUTOINCREMENT | | |
|
||||
| `name` | TEXT | NOT NULL UNIQUE | | Human-readable host name. |
|
||||
| `mac` | TEXT | NOT NULL UNIQUE | | Lowercase colon form (e.g. `aa:bb:cc:dd:ee:ff`). |
|
||||
| `wol_broadcast_ip` | TEXT | NOT NULL | | LAN broadcast IP for Wake-on-LAN magic packets. |
|
||||
| `wol_port` | INTEGER | NOT NULL | `9` | WoL UDP port. |
|
||||
| `expected_spec_yaml` | TEXT | NOT NULL | | YAML describing expected hardware (CPU, memory, disks, firmware). |
|
||||
| `pdu_config_json` | TEXT | | | PDU power control config (future use). |
|
||||
| `ipmi_config_json` | TEXT | | | IPMI config (future use). |
|
||||
| `notes` | TEXT | NOT NULL | `''` | Operator notes. |
|
||||
| `created_at` | TIMESTAMP | NOT NULL | `CURRENT_TIMESTAMP` | |
|
||||
| `updated_at` | TIMESTAMP | NOT NULL | `CURRENT_TIMESTAMP` | |
|
||||
| `last_seen_at` | TIMESTAMP | | | Host-mode agent heartbeat timestamp. NULL = never seen. |
|
||||
|
||||
### `runs`
|
||||
|
||||
Vetting run instances. Each run belongs to one host and walks through
|
||||
the state machine.
|
||||
|
||||
| Column | Type | Constraints | Default | Description |
|
||||
|--------|------|-------------|---------|-------------|
|
||||
| `id` | INTEGER | PK AUTOINCREMENT | | |
|
||||
| `host_id` | INTEGER | NOT NULL FK → hosts(id) CASCADE | | |
|
||||
| `state` | TEXT | NOT NULL | | Current `RunState` (see `internal/model`). |
|
||||
| `result` | TEXT | | | `pass` or `fail` once terminal. |
|
||||
| `failed_stage` | TEXT | | | Stage name that halted the pipeline. |
|
||||
| `next_boot_target` | TEXT | | | `linux`, `memtest`, etc. (future use). |
|
||||
| `agent_token_hash` | TEXT | NOT NULL | | SHA-256 hash of the bearer token. |
|
||||
| `started_at` | TIMESTAMP | NOT NULL | `CURRENT_TIMESTAMP` | |
|
||||
| `completed_at` | TIMESTAMP | | | Set when run reaches a terminal state. |
|
||||
| `report_path` | TEXT | | | Path to `report.json` on disk. |
|
||||
| `hold_ip` | TEXT | | | Agent IP during FailedHolding (for SSH command). |
|
||||
| `override_flags_json` | TEXT | | | JSON blob (e.g. `{"wipe": true}`). |
|
||||
| `non_destructive` | INTEGER | NOT NULL | `0` | `1` = skip badblocks + wipe probe. |
|
||||
| `profile` | TEXT | NOT NULL | `'quick'` | `quick`, `deep`, or `soak`. |
|
||||
|
||||
**Indices:**
|
||||
- `idx_runs_host` on `(host_id)`
|
||||
- `idx_runs_state` on `(state)`
|
||||
|
||||
### `stages`
|
||||
|
||||
Per-stage results within a run. Seeded at `/claim` time with one row
|
||||
per stage in `DefaultStageOrder`.
|
||||
|
||||
| Column | Type | Constraints | Default | Description |
|
||||
|--------|------|-------------|---------|-------------|
|
||||
| `id` | INTEGER | PK AUTOINCREMENT | | |
|
||||
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
|
||||
| `name` | TEXT | NOT NULL | | Stage name (e.g. `SMART`, `CPUStress`). |
|
||||
| `ordinal` | INTEGER | NOT NULL | | 0-based position in the pipeline. |
|
||||
| `state` | TEXT | NOT NULL | | `pending`, `running`, `passed`, `failed`, `skipped`. |
|
||||
| `started_at` | TIMESTAMP | | | Set when the stage begins. |
|
||||
| `completed_at` | TIMESTAMP | | | Set when the stage finishes. |
|
||||
| `summary_json` | TEXT | | | Arbitrary JSON from the agent's result. |
|
||||
|
||||
**Indices:**
|
||||
- `idx_stages_run_ordinal` on `(run_id, ordinal)`
|
||||
|
||||
### `sub_steps`
|
||||
|
||||
Finer-grained units within a stage (per-disk SMART, per-NIC iperf,
|
||||
CPU/memory pass, per-GPU run). Not every stage has sub-steps.
|
||||
|
||||
| Column | Type | Constraints | Default | Description |
|
||||
|--------|------|-------------|---------|-------------|
|
||||
| `id` | INTEGER | PK AUTOINCREMENT | | |
|
||||
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
|
||||
| `stage_name` | TEXT | NOT NULL | | Parent stage name. |
|
||||
| `ordinal` | INTEGER | NOT NULL | | 0-based within `(run_id, stage_name)`. |
|
||||
| `name` | TEXT | NOT NULL | | Human label (e.g. `sda SMART`, `eth0 iperf`). |
|
||||
| `state` | TEXT | NOT NULL | `'pending'` | `pending`, `running`, `passed`, `failed`, `skipped`. |
|
||||
| `started_at` | TIMESTAMP | | | |
|
||||
| `completed_at` | TIMESTAMP | | | |
|
||||
| `summary_json` | TEXT | NOT NULL | `'{}'` | |
|
||||
|
||||
**Constraints:** `UNIQUE (run_id, stage_name, ordinal)`
|
||||
**Indices:** `idx_sub_steps_run` on `(run_id, stage_name, ordinal)`
|
||||
|
||||
### `measurements`
|
||||
|
||||
Time-series sensor data from the thermal sidecar and stage executors.
|
||||
|
||||
| Column | Type | Constraints | Default | Description |
|
||||
|--------|------|-------------|---------|-------------|
|
||||
| `id` | INTEGER | PK AUTOINCREMENT | | |
|
||||
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
|
||||
| `stage_id` | INTEGER | FK → stages(id) SET NULL | | Optional link to a specific stage. |
|
||||
| `ts` | TIMESTAMP | NOT NULL | | Sample timestamp. |
|
||||
| `kind` | TEXT | NOT NULL | | `temp`, `power`, `iperf`, `fio`, `smart_attr`, `psu_volt`, `fan`, etc. |
|
||||
| `key` | TEXT | NOT NULL | | Source identifier (e.g. `cpu/0`, `+12V`). |
|
||||
| `value` | REAL | | | Numeric sample. |
|
||||
| `unit` | TEXT | | | Display unit. |
|
||||
|
||||
**Indices:** `idx_measurements_run_kind_ts` on `(run_id, kind, ts)`
|
||||
|
||||
### `artifacts`
|
||||
|
||||
On-disk file references (reports, fio logs, iperf logs, hold keys).
|
||||
|
||||
| Column | Type | Constraints | Default | Description |
|
||||
|--------|------|-------------|---------|-------------|
|
||||
| `id` | INTEGER | PK AUTOINCREMENT | | |
|
||||
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
|
||||
| `stage_id` | INTEGER | FK → stages(id) SET NULL | | |
|
||||
| `kind` | TEXT | NOT NULL | | `inventory`, `report`, `report_html`, `hold_key`, `fio`, `iperf`. |
|
||||
| `path` | TEXT | NOT NULL | | Absolute path on disk. |
|
||||
| `sha256` | TEXT | NOT NULL | | SHA-256 hex digest. |
|
||||
| `size_bytes` | INTEGER | NOT NULL | | File size. |
|
||||
|
||||
### `spec_diffs`
|
||||
|
||||
Expected-vs-actual hardware divergences from SpecValidate.
|
||||
|
||||
| Column | Type | Constraints | Default | Description |
|
||||
|--------|------|-------------|---------|-------------|
|
||||
| `id` | INTEGER | PK AUTOINCREMENT | | |
|
||||
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
|
||||
| `field` | TEXT | NOT NULL | | Dotted path (e.g. `memory.total_gib`, `cpu.logical_cores`). |
|
||||
| `expected` | TEXT | | | Expected value from the host's spec YAML. |
|
||||
| `actual` | TEXT | | | Observed value from the inventory probe. |
|
||||
| `severity` | TEXT | NOT NULL | | `critical`, `warning`, `info`. |
|
||||
| `ignored` | INTEGER | NOT NULL | `0` | `1` = operator chose to ignore this diff. |
|
||||
|
||||
### `thresholds`
|
||||
|
||||
Per-run threshold rules, seeded from the `ProfileRegistry` + per-host
|
||||
overrides at run creation. Immutable for the run's lifetime.
|
||||
|
||||
| Column | Type | Constraints | Default | Description |
|
||||
|--------|------|-------------|---------|-------------|
|
||||
| `id` | INTEGER | PK AUTOINCREMENT | | |
|
||||
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
|
||||
| `stage_name` | TEXT | NOT NULL | | `*` matches any stage. |
|
||||
| `kind` | TEXT | NOT NULL | | Measurement kind to match. |
|
||||
| `key` | TEXT | NOT NULL | | Key selector (glob-ish). |
|
||||
| `op` | TEXT | NOT NULL | | `lt`, `lte`, `gt`, `gte`, `within_pct`. |
|
||||
| `threshold` | REAL | NOT NULL | | Limit value. |
|
||||
| `nominal` | REAL | NOT NULL | `0` | Reference for `within_pct`. |
|
||||
| `unit` | TEXT | NOT NULL | `''` | Display unit. |
|
||||
| `severity` | TEXT | NOT NULL | | `critical` or `warning`. |
|
||||
| `source` | TEXT | NOT NULL | | `profile` or `host_override`. |
|
||||
|
||||
**Indices:**
|
||||
- `idx_thresholds_run` on `(run_id)`
|
||||
- `idx_thresholds_kind` on `(run_id, stage_name, kind)`
|
||||
|
||||
### `threshold_evaluations`
|
||||
|
||||
Per-sample pass/fail results from threshold evaluation. Drives
|
||||
report badges and pipeline verdict rendering.
|
||||
|
||||
| Column | Type | Constraints | Default | Description |
|
||||
|--------|------|-------------|---------|-------------|
|
||||
| `id` | INTEGER | PK AUTOINCREMENT | | |
|
||||
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
|
||||
| `threshold_id` | INTEGER | NOT NULL FK → thresholds(id) CASCADE | | |
|
||||
| `stage_name` | TEXT | NOT NULL | | Stage the sample belongs to. |
|
||||
| `kind` | TEXT | NOT NULL | | Measurement kind. |
|
||||
| `key` | TEXT | NOT NULL | | Source key. |
|
||||
| `ts` | TIMESTAMP | NOT NULL | | Sample timestamp. |
|
||||
| `observed` | REAL | NOT NULL | | Observed value. |
|
||||
| `passed` | INTEGER | NOT NULL | | `1` = within threshold, `0` = breach. |
|
||||
|
||||
**Indices:** `idx_threshold_evals_run` on `(run_id, passed)`
|
||||
|
||||
### `firmware_snapshots`
|
||||
|
||||
Per-run firmware version captures (BIOS, BMC, NIC, HBA, microcode,
|
||||
NVMe). Populated by the Firmware stage; consumed by SpecValidate for
|
||||
firmware version diffing.
|
||||
|
||||
| Column | Type | Constraints | Default | Description |
|
||||
|--------|------|-------------|---------|-------------|
|
||||
| `id` | INTEGER | PK AUTOINCREMENT | | |
|
||||
| `run_id` | INTEGER | NOT NULL FK → runs(id) CASCADE | | |
|
||||
| `component` | TEXT | NOT NULL | | `bios`, `bmc`, `nic`, `hba`, `microcode`, `nvme_fw`. |
|
||||
| `identifier` | TEXT | NOT NULL | | Slot, serial, or device path distinguishing this component. |
|
||||
| `version` | TEXT | NOT NULL | | Firmware version string. |
|
||||
| `vendor` | TEXT | NOT NULL | `''` | |
|
||||
| `raw_json` | TEXT | NOT NULL | `'{}'` | Additional metadata. |
|
||||
|
||||
**Indices:** `idx_firmware_run` on `(run_id, component)`
|
||||
|
||||
### `events`
|
||||
|
||||
Event log table. Reserved for future use.
|
||||
|
||||
| Column | Type | Constraints | Default | Description |
|
||||
|--------|------|-------------|---------|-------------|
|
||||
| `id` | INTEGER | PK AUTOINCREMENT | | |
|
||||
| `run_id` | INTEGER | FK → runs(id) CASCADE | | |
|
||||
| `host_id` | INTEGER | FK → hosts(id) CASCADE | | |
|
||||
| `ts` | TIMESTAMP | NOT NULL | | |
|
||||
| `level` | TEXT | NOT NULL | | |
|
||||
| `kind` | TEXT | NOT NULL | | |
|
||||
| `message` | TEXT | NOT NULL | | |
|
||||
| `data_json` | TEXT | | | |
|
||||
|
||||
### `settings`
|
||||
|
||||
Key-value store for orchestrator-level settings.
|
||||
|
||||
| Column | Type | Constraints | Description |
|
||||
|--------|------|-------------|-------------|
|
||||
| `key` | TEXT | PK | |
|
||||
| `value` | TEXT | NOT NULL | |
|
||||
|
||||
---
|
||||
|
||||
## Entity relationships
|
||||
|
||||
```
|
||||
hosts 1───N runs
|
||||
├──N stages
|
||||
│ └──(FK) measurements (stage_id, SET NULL)
|
||||
│ └──(FK) artifacts (stage_id, SET NULL)
|
||||
├──N sub_steps
|
||||
├──N measurements (run_id)
|
||||
├──N artifacts (run_id)
|
||||
├──N spec_diffs
|
||||
├──N thresholds
|
||||
│ └──N threshold_evaluations
|
||||
└──N firmware_snapshots
|
||||
```
|
||||
|
||||
All foreign keys use `ON DELETE CASCADE` (except `stage_id` references
|
||||
which use `SET NULL`). Deleting a host cascades through its runs and
|
||||
all dependent rows.
|
||||
|
||||
## Data retention
|
||||
|
||||
The janitor goroutine prunes **on-disk files** (artifacts, logs) based
|
||||
on `artifacts.retention_days` and `logs.retention_days`. **Database
|
||||
rows are never deleted** by the janitor — run histories, measurement
|
||||
time-series, spec diffs, and threshold evaluations survive cleanups
|
||||
indefinitely.
|
||||
|
||||
See [architecture.md § Data retention](architecture.md#data-retention)
|
||||
and [configuration.md § janitor](configuration.md#janitor).
|
||||
|
||||
## Migration history
|
||||
|
||||
| File | What it adds |
|
||||
|------|-------------|
|
||||
| `0001_init.sql` | Core schema: `hosts`, `runs`, `stages`, `measurements`, `artifacts`, `spec_diffs`, `events`, `settings`. |
|
||||
| `0002_add_hosts_last_seen_at.sql` | `hosts.last_seen_at` column for host-mode agent heartbeats. |
|
||||
| `0003_add_runs_non_destructive.sql` | `runs.non_destructive` boolean flag. |
|
||||
| `0004_add_sub_steps.sql` | `sub_steps` table for per-disk/per-NIC granular stage detail. |
|
||||
| `0005_profiles_thresholds_firmware.sql` | `runs.profile` column, `thresholds` + `threshold_evaluations` tables, `firmware_snapshots` table. |
|
||||
|
||||
All migrations are additive — no schema deletions or renames.
|
||||
Reference in New Issue
Block a user