Post-repair hardware validation pipeline for Proxmox cluster hosts. Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
This commit is contained in:
@@ -0,0 +1,178 @@
|
||||
# Architecture
|
||||
|
||||
A single Go binary runs the orchestrator. A second Go binary runs
|
||||
inside a custom Debian live image (built with mkosi) and becomes the
|
||||
per-run test agent. The two talk over HTTP + SSE.
|
||||
|
||||
```
|
||||
Operator browser (HTMX + SSE, admin login)
|
||||
│ HTTPS
|
||||
▼
|
||||
┌───────────────────────────────────────────────────────────────┐
|
||||
│ Orchestrator LXC — single Go binary `vetting` │
|
||||
│ │
|
||||
│ UI (Templ) ─┬─ Agent API ─┬─ SSE hub │
|
||||
│ │ │ │
|
||||
│ Orchestrator core (state machine, dispatcher sem=3, │
|
||||
│ stage executors, WoL sender, token issuer) │
|
||||
│ │ │
|
||||
│ ┌─────┴─────┬──────────┐ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ SQLite flat-file logs dnsmasq subprocess │
|
||||
│ (DHCP+TFTP+HTTP, MAC allowlist)│
|
||||
│ │
|
||||
│ Janitor goroutine (retention-based cleanup) │
|
||||
│ Notifier registry (ntfy/discord/smtp) │
|
||||
└─────────────────────────────────────────┬─────────────────────┘
|
||||
│ LAN
|
||||
▼
|
||||
Host under test (×2–3)
|
||||
PXE → iPXE → Linux live image
|
||||
└─ vetting-agent (HTTP+SSE back)
|
||||
```
|
||||
|
||||
## Packages
|
||||
|
||||
| Package | Purpose |
|
||||
|---|---|
|
||||
| `cmd/vetting` | Orchestrator entrypoint. Wires config, stores, runner, dispatcher, iperf supervisor, PXE supervisor, janitor, HTTP router. |
|
||||
| `cmd/vetting-agent` | In-image agent entrypoint. Reads kernel cmdline params, starts the agent loop. |
|
||||
| `internal/config` | YAML loader + types. |
|
||||
| `internal/db` | SQLite open + embedded migrations. Pure Go via modernc.org/sqlite. |
|
||||
| `internal/model` | Plain structs: `Host`, `Run`, `Stage`, `Measurement`, `SpecDiff`, `Artifact`. |
|
||||
| `internal/store` | Repository layer; SQL is hand-written. |
|
||||
| `internal/orchestrator` | State machine, dispatcher, per-run runner, WoL sender, HMAC run tokens, iperf supervisor. |
|
||||
| `internal/api` | HTTP handlers: `agent_handlers.go` (the agent-facing API) and `ui_handlers.go` (HTMX fragments + SSE). |
|
||||
| `internal/httpserver` | chi router assembly — lives here to avoid `api ↔ orchestrator` cyclic imports. |
|
||||
| `internal/web` | Embedded static assets + compiled Templ templates. |
|
||||
| `internal/auth` | Single-admin bcrypt + signed-cookie sessions. |
|
||||
| `internal/pxe` | dnsmasq subprocess supervisor + per-MAC iPXE script generator. |
|
||||
| `internal/events` | In-process SSE hub (fan-out to live browser clients). |
|
||||
| `internal/logs` | Per-run flat-file writer + SSE fan-out of live log tail. |
|
||||
| `internal/spec` | Expected-vs-actual diff engine with severity classification. |
|
||||
| `internal/notify` | Pluggable notifier registry (ntfy, Discord webhook, SMTP). |
|
||||
| `internal/report` | HTML + JSON report generation (html/template, self-contained). |
|
||||
| `internal/hold` | Per-run SSH key issuance for `FailedHolding`. |
|
||||
| `internal/janitor` | Retention-based cleanup of old artifact files + log files. |
|
||||
| `agent/` | In-image agent: claim loop, stage dispatch, heartbeat, log forwarder, thermal sidecar. |
|
||||
| `agent/probes` | lshw, dmidecode, smartctl, lspci, hwmon, nvidia-smi wrappers. |
|
||||
| `agent/tests` | Per-stage test implementations (SMART, CPUStress, Storage, Network, GPU, PSU). |
|
||||
| `live-image/` | mkosi config + postinst for the Debian live image. |
|
||||
| `deploy/` | systemd unit + example config + install.sh. |
|
||||
| `test/e2e/` | Build-tagged (`-tags=e2e`) QEMU + PXE full-stack test. |
|
||||
|
||||
## State machine
|
||||
|
||||
Per-run state is the single source of truth; the UI is a pure
|
||||
projection of DB + event stream.
|
||||
|
||||
```
|
||||
Registered → Queued → WaitingWoL → Booting → InventoryCheck
|
||||
→ SpecValidate → SMART → CPUStress → Storage → Network
|
||||
→ GPU → PSU → Reporting → Completed
|
||||
|
||||
any stage → Failed → FailedHolding → Released
|
||||
```
|
||||
|
||||
Key points:
|
||||
|
||||
- **Transitions are table-driven** (`internal/orchestrator/statemachine.go`).
|
||||
Each `(state, event) → (next, action)` is encoded once.
|
||||
- **Orchestrator-owned stages resolve inside `/result`:** `SpecValidate`
|
||||
and `Reporting` flip state forward as part of the preceding stage's
|
||||
result handler, so the agent never sees them as "its turn".
|
||||
- **Stage rows persist before SSE fan-out** — the UI can re-derive
|
||||
state by reading SQLite, and an SSE reconnect mid-run just fetches
|
||||
fresh tile fragments.
|
||||
|
||||
## Agent ↔ orchestrator protocol
|
||||
|
||||
```
|
||||
GET /ipxe/{MAC} → per-MAC iPXE script
|
||||
POST /api/v1/runs/{id}/hello → "I booted; here's my address"
|
||||
POST /api/v1/runs/{id}/claim → validate token, receive stage list
|
||||
POST /api/v1/runs/{id}/heartbeat → liveness ping; response carries cmd
|
||||
POST /api/v1/runs/{id}/log → batch of log lines
|
||||
POST /api/v1/runs/{id}/sensor → batch of measurements (thermals, throughput)
|
||||
POST /api/v1/runs/{id}/result → stage result; response says next_state
|
||||
POST /api/v1/runs/{id}/hold → on FailedHolding, receive authorized_key
|
||||
```
|
||||
|
||||
Auth on every `/api/v1/*` call: the bearer token is stored as a bcrypt
|
||||
hash in `runs.agent_token_hash` and compared in constant time. The
|
||||
plaintext is in the kernel cmdline — unforgeable by anyone not on the
|
||||
trusted bridge, because the iPXE script is issued per-MAC and the MAC
|
||||
must already be in the dnsmasq allowlist.
|
||||
|
||||
### Heartbeat control channel
|
||||
|
||||
The heartbeat response carries a `cmd` field the agent acts on:
|
||||
|
||||
| cmd | When fired | Agent action |
|
||||
|---|---|---|
|
||||
| `continue` | Normal case | No-op; keep running current stage |
|
||||
| `shutdown` | Run reached `Completed` | `systemctl poweroff` |
|
||||
| `abort` | Run in `FailedHolding` or `Released` | Stop heartbeat loop; let the operator drive |
|
||||
| `retry_stage` | Operator pressed "Override wipe" | Re-enter the named stage with `override_flags` armed |
|
||||
|
||||
## Safety: destructive disk tests
|
||||
|
||||
Four layered gates:
|
||||
|
||||
1. **MAC allowlist** — dnsmasq only answers DHCP for registered MACs.
|
||||
2. **Signed run token** — orchestrator issues a per-run HMAC token in
|
||||
the iPXE kernel cmdline; the agent submits it on `/claim` and the
|
||||
orchestrator verifies before handing back the stage list.
|
||||
3. **Wipe probe** — before `badblocks`, the agent scans for filesystem
|
||||
signatures / LVM metadata / partition tables. Anything found →
|
||||
`FailedHolding` on `Storage`. The operator explicitly clicks
|
||||
**Override wipe-probe** to proceed.
|
||||
4. **Device allowlist** — the agent only targets block devices matching
|
||||
the inventory's `expected_disks`. USB sticks and surprise disks are
|
||||
skipped.
|
||||
|
||||
## Notifications
|
||||
|
||||
Fire-and-forget. The orchestrator fires four event kinds:
|
||||
|
||||
| Kind | Severity | When |
|
||||
|---|---|---|
|
||||
| `StageFailed` | critical | Any stage returns `passed=false` |
|
||||
| `SpecMismatch` | critical | `SpecValidate` finds critical diffs |
|
||||
| `HoldingOpened` | critical | Agent POSTs `/hold` (operator can SSH in) |
|
||||
| `RunCompleted` | info | Pipeline reaches `Completed` |
|
||||
|
||||
The config maps event kinds and severities to one or more notifiers
|
||||
(ntfy, Discord webhook, SMTP). Each notifier gets one attempt per
|
||||
event with a 10s timeout; delivery failures are logged, nothing is
|
||||
persisted.
|
||||
|
||||
## Why a separate notify package?
|
||||
|
||||
Keeps the `/result` and `/hold` handlers non-blocking. Each dispatch
|
||||
starts a goroutine per target; a slow ntfy server doesn't back up an
|
||||
SMTP notifier or delay the HTTP response to the agent.
|
||||
|
||||
## Data retention
|
||||
|
||||
The janitor goroutine (`internal/janitor`) runs a sweep every
|
||||
`janitor.interval_minutes` (default 60) and deletes:
|
||||
|
||||
- artifact files older than `artifacts.retention_days`, plus their
|
||||
`artifacts` table rows
|
||||
- log files older than `logs.retention_days`
|
||||
|
||||
`runs`, `hosts`, `stages`, `measurements`, `spec_diffs` rows are
|
||||
**never** deleted by the janitor — host histories and aggregate
|
||||
metrics survive cleanups.
|
||||
|
||||
## Reproducible builds
|
||||
|
||||
The orchestrator and agent are pure Go; `make orchestrator-linux`
|
||||
cross-compiles to `linux-amd64` from Windows or macOS.
|
||||
|
||||
The live image requires Linux-side tooling (mkosi, debootstrap,
|
||||
squashfs-tools) so `make live-image` fails loudly on Windows and
|
||||
redirects to `wsl make live-image`. Pinning to snapshot.debian.org in
|
||||
`live-image/mkosi.conf` keeps image bits stable across time for a
|
||||
given git SHA.
|
||||
@@ -0,0 +1,171 @@
|
||||
# Operations
|
||||
|
||||
Operator-facing runbook for the vetting orchestrator. If you're looking
|
||||
for the "what does the system do" overview, see
|
||||
[architecture.md](architecture.md). For what each test stage actually
|
||||
measures, see [test-suite.md](test-suite.md).
|
||||
|
||||
## Install (Proxmox LXC)
|
||||
|
||||
Target: a Debian/Ubuntu LXC on the Proxmox host that holds the cluster
|
||||
you're vetting for. The LXC must be on the same L2 segment as the
|
||||
repaired nodes so DHCP and WoL work.
|
||||
|
||||
1. On your workstation, cross-build the binary:
|
||||
|
||||
```
|
||||
make orchestrator-linux
|
||||
```
|
||||
|
||||
This produces `bin/vetting-linux-amd64`.
|
||||
|
||||
2. Copy the repo tree (or just `bin/`, `deploy/`) into the LXC, then
|
||||
from inside the LXC:
|
||||
|
||||
```
|
||||
sudo ./deploy/install.sh
|
||||
```
|
||||
|
||||
The installer:
|
||||
- `apt install`s `dnsmasq`, `iperf3`, `ca-certificates`
|
||||
- creates the `vetting` system user (home = `/var/lib/vetting`)
|
||||
- installs the binary into `/usr/local/bin/vetting`
|
||||
- drops `vetting.example.yaml` into `/etc/vetting/vetting.yaml`
|
||||
(only if there's no existing config — existing configs are
|
||||
preserved)
|
||||
- drops `/etc/systemd/system/vetting.service`
|
||||
- disables the distro-default dnsmasq (the orchestrator supervises
|
||||
its own)
|
||||
|
||||
The installer does **not** enable the service, because the default
|
||||
config has a placeholder bcrypt password that the binary refuses to
|
||||
start with.
|
||||
|
||||
3. Generate an admin password hash and a session secret, then edit
|
||||
`/etc/vetting/vetting.yaml`:
|
||||
|
||||
```
|
||||
./bin/gen-admin-password 'your-password-here' # prints a bcrypt hash
|
||||
openssl rand -hex 32 # prints a 64-char hex string
|
||||
```
|
||||
|
||||
Required fields:
|
||||
- `auth.admin_password_bcrypt` — the bcrypt hash
|
||||
- `auth.session_secret_hex` — the 32-byte hex string
|
||||
- `server.public_url` — the URL your browser hits the LXC on
|
||||
(e.g. `https://vetting.lan:8443`). This is used as the
|
||||
click-through link in notifications, so it must be the *external*
|
||||
URL, not the bind address.
|
||||
|
||||
4. (Optional) Configure notifiers in the same file — see the
|
||||
commented-out example block for ntfy / Discord / SMTP.
|
||||
|
||||
5. Enable and start:
|
||||
|
||||
```
|
||||
sudo systemctl enable --now vetting
|
||||
sudo journalctl -fu vetting
|
||||
```
|
||||
|
||||
## First vetting run
|
||||
|
||||
Against a QEMU VM first, before you point it at real hardware:
|
||||
|
||||
1. On the Proxmox host (or wherever your LXC lives):
|
||||
|
||||
```
|
||||
sudo ip link add br-vetting type bridge
|
||||
sudo ip addr add 10.77.0.1/24 dev br-vetting
|
||||
sudo ip link set br-vetting up
|
||||
```
|
||||
|
||||
2. In the UI at `https://<lxc>:8443`, log in and register a host:
|
||||
- Name: `qemu-test`
|
||||
- MAC: `52:54:00:12:34:56`
|
||||
- WoL broadcast IP: `10.77.0.255`
|
||||
- Expected spec: paste a minimal YAML like
|
||||
```yaml
|
||||
memory: { total_gib: 4 }
|
||||
cpu: { logical_cores: 4 }
|
||||
```
|
||||
|
||||
3. Click **Start Vetting**. The UI tile will sit at `Queued → WaitingWoL`.
|
||||
|
||||
4. Launch the QEMU VM on the bridge so it PXE-boots from dnsmasq:
|
||||
|
||||
```
|
||||
sudo qemu-system-x86_64 \
|
||||
-enable-kvm -cpu host -smp 4 -m 4096 \
|
||||
-netdev bridge,id=n0,br=br-vetting \
|
||||
-device virtio-net-pci,netdev=n0,mac=52:54:00:12:34:56 \
|
||||
-drive file=/tmp/test-disk.img,format=raw,if=virtio \
|
||||
-boot n -serial mon:stdio -display none
|
||||
```
|
||||
|
||||
5. Watch the tile advance through stages. On success, the tile shows
|
||||
**View report** and the VM auto-shuts-down.
|
||||
|
||||
For real repaired hardware: same flow, but register the node's actual
|
||||
MAC + expected spec, and make sure the node's BIOS is set to PXE-boot
|
||||
from the NIC that's on the `br-vetting` network.
|
||||
|
||||
## A failed run — SSH to the held host
|
||||
|
||||
When a stage fails, the pipeline halts at `FailedHolding` and the
|
||||
agent installs an orchestrator-issued SSH key into the live-image's
|
||||
`/root/.ssh/authorized_keys`. The UI tile surfaces the IP and the
|
||||
exact `ssh` command.
|
||||
|
||||
The hold key is **per-run**. Once you're done:
|
||||
|
||||
1. Power the host off (`poweroff` from the SSH session).
|
||||
2. In the UI, click **Override wipe-probe** only when the failure was
|
||||
at the `Storage` stage *and* you're sure the disks are expendable.
|
||||
Otherwise click **Start vetting** on a fresh run from the host
|
||||
dashboard after fixing the underlying issue.
|
||||
|
||||
## Log + artifact layout
|
||||
|
||||
```
|
||||
/var/lib/vetting/
|
||||
vetting.db # SQLite: hosts, runs, stages, artifacts, spec_diffs, measurements
|
||||
artifacts/
|
||||
run-<N>/
|
||||
report.html # operator-facing summary
|
||||
report.json # machine-readable summary
|
||||
inventory.json # raw probe output
|
||||
fio-<disk>.log # storage stage output
|
||||
iperf-<nic>.json # network stage output
|
||||
hold-<N>.pub # per-run SSH pubkey (only if held)
|
||||
/var/log/vetting/
|
||||
run-<N>.log # append-only per-run log tail
|
||||
```
|
||||
|
||||
Retention is governed by the `artifacts.retention_days` and
|
||||
`logs.retention_days` settings. DB rows (run history) are preserved
|
||||
indefinitely; only on-disk files get pruned.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Symptom | First check |
|
||||
|---|---|
|
||||
| Service refuses to start with `auth.admin_password_bcrypt is the placeholder` | You didn't replace the bcrypt hash in the config. Run `gen-admin-password`. |
|
||||
| PXE client gets no DHCP offer | `journalctl -u vetting` for dnsmasq errors; confirm the LXC has `CAP_NET_ADMIN` (the shipped systemd unit does); confirm the host MAC is actually registered (`sqlite3 /var/lib/vetting/vetting.db 'SELECT name, mac FROM hosts;'`). |
|
||||
| Agent `/hello` never fires | Check the live image is actually loading the agent binary — SSH into the live env (use the hold key path), `systemctl status vetting-agent`. |
|
||||
| Tile stuck on `Booting` | Most likely the live image booted but the agent can't reach the orchestrator. Verify `vetting.orchestrator=` in the kernel cmdline resolves from the host's network. |
|
||||
| UI shows stale stage | Force a reload; the SSE reconnect is automatic but the browser keeps the last state on ephemeral network blips. |
|
||||
| Notification didn't fire | `journalctl -u vetting \| grep notify:` — delivery is fire-and-forget and the failure reason is logged but not persisted. |
|
||||
|
||||
## Upgrading
|
||||
|
||||
1. `make orchestrator-linux` on your workstation.
|
||||
2. `scp bin/vetting-linux-amd64 lxc:/tmp/vetting.new`
|
||||
3. On the LXC:
|
||||
```
|
||||
sudo systemctl stop vetting
|
||||
sudo install -m 0755 /tmp/vetting.new /usr/local/bin/vetting
|
||||
sudo systemctl start vetting
|
||||
```
|
||||
|
||||
The DB migration runs at startup and is append-only — no manual schema
|
||||
work unless a release's notes call it out.
|
||||
@@ -0,0 +1,166 @@
|
||||
# Test suite
|
||||
|
||||
What each stage measures, what "pass" means, and where the results
|
||||
land. Stages run strictly in order. Any stage returning `passed=false`
|
||||
halts the pipeline at `FailedHolding` — the operator decides whether
|
||||
to fix, override, or abandon.
|
||||
|
||||
## Stage order
|
||||
|
||||
```
|
||||
Inventory → SpecValidate → SMART → CPUStress → Storage
|
||||
→ Network → GPU → PSU → Reporting
|
||||
```
|
||||
|
||||
Stages marked *orchestrator-owned* resolve inside `/result` and never
|
||||
show up as "the agent's turn".
|
||||
|
||||
---
|
||||
|
||||
## Inventory
|
||||
|
||||
**Owner:** agent.
|
||||
**What it does:** `dmidecode`, `lscpu`, `lshw`, `lspci`, `smartctl -i`
|
||||
over each block device, `nvidia-smi -q` if present. The raw output is
|
||||
merged into a single JSON blob.
|
||||
**Pass:** the probes run to completion; missing optional tools (e.g.
|
||||
`nvidia-smi` on a GPU-less host) are tolerated.
|
||||
**Artifacts:** `inventory.json` under `artifacts/run-<N>/`.
|
||||
|
||||
## SpecValidate *(orchestrator-owned)*
|
||||
|
||||
**Owner:** orchestrator (resolves inline inside the `/result` for the
|
||||
preceding Inventory stage).
|
||||
**What it does:** diffs the submitted inventory against the host's
|
||||
`expected_spec_yaml`. The diff engine classifies each field as
|
||||
`critical`, `warning`, or `info`.
|
||||
**Pass:** zero `critical` diffs.
|
||||
**Fail mode:** fires a `SpecMismatch` notification; transitions run
|
||||
to `Failed → FailedHolding`.
|
||||
**Artifacts:** `spec_diffs` table rows (one per divergence).
|
||||
|
||||
## SMART
|
||||
|
||||
**Owner:** agent.
|
||||
**What it does:** `smartctl -a /dev/<disk>` for each disk in the
|
||||
inventory's `expected_disks`. Parses reallocated-sector counts, pending
|
||||
sectors, end-to-end error counters, overall-health attribute.
|
||||
**Pass:** SMART overall-health is PASSED on every expected disk and
|
||||
reallocated-sector count is below threshold.
|
||||
**Artifacts:** `smart-<disk>.txt` raw output.
|
||||
|
||||
## CPUStress
|
||||
|
||||
**Owner:** agent.
|
||||
**What it does:** runs `stress-ng --cpu N --vm M --vm-bytes 90% -t
|
||||
120s` with `N = logical_cores` and `M ≈ logical_cores/2`. The `--vm`
|
||||
flag is the **stand-in for Memtest86+**: it exercises the memory
|
||||
subsystem under load and will fail if the RAM has latent faults that
|
||||
surface under thermal + allocator pressure.
|
||||
**Pass:** `stress-ng` exits 0 and thermal samples taken by the sidecar
|
||||
stay below the configured per-host `max_temp_c`.
|
||||
**Caveat:** weaker than a dedicated memtest pass; see
|
||||
[architecture.md](architecture.md) for the reasoning (Memtest86+
|
||||
can't be signalled back without IPMI serial).
|
||||
|
||||
## Storage
|
||||
|
||||
**Owner:** agent (destructive).
|
||||
**What it does:**
|
||||
|
||||
1. **Wipe probe** — scans for filesystem signatures, LVM metadata,
|
||||
partition tables on the expected disks. Any hit → halt with
|
||||
`UnexpectedData`; operator must click **Override wipe-probe**.
|
||||
2. `badblocks -svw` (destructive read/write) on each expected disk.
|
||||
3. `fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1G` on
|
||||
each disk; captures IOPS and p99 latency.
|
||||
|
||||
**Pass:** badblocks reports zero bad blocks; fio IOPS above a
|
||||
per-class floor (configurable).
|
||||
**Artifacts:** `fio-<disk>.json` per disk.
|
||||
**Safety gate:** the wipe-probe + device allowlist are the second and
|
||||
third lines of defense against wiping the wrong disk. See
|
||||
[architecture.md § Safety](architecture.md#safety-destructive-disk-tests).
|
||||
|
||||
## Network
|
||||
|
||||
**Owner:** agent.
|
||||
**What it does:** `iperf3 -c <orchestrator> -p <iperf_port> -t 10 -J`
|
||||
to measure throughput to the orchestrator. The orchestrator-side
|
||||
`iperf3 -s` is supervised by `internal/orchestrator/iperf.go` and
|
||||
binds to the configured `network.iperf_port`.
|
||||
**Pass:** throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps
|
||||
for 10GbE).
|
||||
**Artifacts:** `iperf-<nic>.json`.
|
||||
|
||||
## GPU
|
||||
|
||||
**Owner:** agent.
|
||||
**What it does:** runs `nvidia-smi -q` and a short compute workload
|
||||
(`gpu-burn` if present, else `nvidia-smi dmon` during a `stress-ng
|
||||
--gpu` burst). Skipped cleanly when no GPU is present.
|
||||
**Pass:** no ECC errors reported; temperature below threshold; compute
|
||||
workload exits 0.
|
||||
|
||||
## PSU
|
||||
|
||||
**Owner:** agent.
|
||||
**What it does:** reads `/sys/class/hwmon/*/power_average` and `in*_input`
|
||||
during a synthetic load burst (CPU + disk + NIC simultaneously) to
|
||||
look for voltage sag or wattage anomalies. Records the full envelope
|
||||
as `measurements` rows with `kind=psu`.
|
||||
**Pass:** no voltage dip below threshold across the load burst.
|
||||
**Caveat:** only reports on what the BMC exposes via hwmon — servers
|
||||
without exposed PSU telemetry pass trivially. Documented limitation.
|
||||
|
||||
## Reporting *(orchestrator-owned)*
|
||||
|
||||
**Owner:** orchestrator (resolves inline inside the `/result` for PSU).
|
||||
**What it does:**
|
||||
|
||||
1. Gathers run, host, stages, spec_diffs, and measurement aggregates.
|
||||
2. Renders `report.html` via `internal/report` (html/template with
|
||||
inlined CSS; self-contained offline-viewable).
|
||||
3. Writes `report.json` with the same data in machine-readable form.
|
||||
4. Records both as `report_html` / `report_json` artifact rows.
|
||||
5. Transitions run → `Completed`.
|
||||
6. Fires `RunCompleted` notification.
|
||||
7. The next agent heartbeat returns `cmd=shutdown`.
|
||||
|
||||
## Thermal sidecar
|
||||
|
||||
**Owner:** agent (always-on from `Booting` until the agent exits).
|
||||
**What it does:** every 5 seconds, walks `/sys/class/hwmon/*` and
|
||||
POSTs temperature samples as a batch to `/sensor`. Populates the
|
||||
`measurements` table with `kind=thermal`.
|
||||
**No pass/fail** on its own — stages that care about thermals read the
|
||||
sidecar's data via `measurements`. A dead sensor just drops out of
|
||||
the next batch.
|
||||
|
||||
---
|
||||
|
||||
## Where pass/fail lives
|
||||
|
||||
- `runs.state` — authoritative terminal state (`Completed`,
|
||||
`FailedHolding`, `Released`).
|
||||
- `runs.result` — `pass` or `fail` string once the run completes.
|
||||
- `runs.failed_stage` — name of the stage that halted the pipeline, if
|
||||
any. Cleared when the operator overrides and re-enters.
|
||||
- `stages` — one row per attempted stage with `passed`, `started_at`,
|
||||
`completed_at`, `summary_json`, `message`.
|
||||
- `measurements` — time-series samples from the thermal sidecar and
|
||||
from stages that capture numeric outputs.
|
||||
- `artifacts` — on-disk files (report, fio logs, iperf logs, etc).
|
||||
- `spec_diffs` — one row per expected-vs-actual divergence.
|
||||
|
||||
## Adding a new stage
|
||||
|
||||
1. Add the name to `store.DefaultStageOrder`.
|
||||
2. Add a `model.State<Name>` const and wire it into
|
||||
`internal/orchestrator/statemachine.go` (both the forward
|
||||
transition table and the stage-for-state lookup).
|
||||
3. Add a case to `agent/runner.go`'s `runStage` dispatch.
|
||||
4. Drop the implementation into `agent/tests/`.
|
||||
5. If the stage is orchestrator-owned, add a `resolve<Name>` helper to
|
||||
`internal/api/agent_handlers.go` and invoke it from the `/result`
|
||||
handler after the preceding stage's `NextState` resolves.
|
||||
Reference in New Issue
Block a user