Files

T

CI / Lint + build + test (push) Has been cancelled

Details

Initial commit: full Phases 1-6 implementation

Post-repair hardware validation pipeline for Proxmox cluster hosts.
Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq
PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.

2026-04-17 21:32:10 -04:00

6.4 KiB

Raw Blame History

Test suite

What each stage measures, what "pass" means, and where the results land. Stages run strictly in order. Any stage returning passed=false halts the pipeline at FailedHolding — the operator decides whether to fix, override, or abandon.

Stage order

Inventory → SpecValidate → SMART → CPUStress → Storage
         → Network → GPU → PSU → Reporting

Stages marked orchestrator-owned resolve inside /result and never show up as "the agent's turn".

Inventory

Owner: agent. What it does: dmidecode, lscpu, lshw, lspci, smartctl -i over each block device, nvidia-smi -q if present. The raw output is merged into a single JSON blob. Pass: the probes run to completion; missing optional tools (e.g. nvidia-smi on a GPU-less host) are tolerated. Artifacts: inventory.json under artifacts/run-<N>/.

SpecValidate (orchestrator-owned)

Owner: orchestrator (resolves inline inside the /result for the preceding Inventory stage). What it does: diffs the submitted inventory against the host's expected_spec_yaml. The diff engine classifies each field as critical, warning, or info. Pass: zero critical diffs. Fail mode: fires a SpecMismatch notification; transitions run to Failed → FailedHolding. Artifacts: spec_diffs table rows (one per divergence).

SMART

Owner: agent. What it does: smartctl -a /dev/<disk> for each disk in the inventory's expected_disks. Parses reallocated-sector counts, pending sectors, end-to-end error counters, overall-health attribute. Pass: SMART overall-health is PASSED on every expected disk and reallocated-sector count is below threshold. Artifacts: smart-<disk>.txt raw output.

CPUStress

Owner: agent. What it does: runs stress-ng --cpu N --vm M --vm-bytes 90% -t 120s with N = logical_cores and M ≈ logical_cores/2. The --vm flag is the stand-in for Memtest86+: it exercises the memory subsystem under load and will fail if the RAM has latent faults that surface under thermal + allocator pressure. Pass: stress-ng exits 0 and thermal samples taken by the sidecar stay below the configured per-host max_temp_c. Caveat: weaker than a dedicated memtest pass; see architecture.md for the reasoning (Memtest86+ can't be signalled back without IPMI serial).

Storage

Owner: agent (destructive). What it does:

Wipe probe — scans for filesystem signatures, LVM metadata, partition tables on the expected disks. Any hit → halt with UnexpectedData; operator must click Override wipe-probe.
badblocks -svw (destructive read/write) on each expected disk.
fio --rw=randrw --bs=4k --iodepth=32 --runtime=60 --size=1G on each disk; captures IOPS and p99 latency.

Pass: badblocks reports zero bad blocks; fio IOPS above a per-class floor (configurable). Artifacts: fio-<disk>.json per disk. Safety gate: the wipe-probe + device allowlist are the second and third lines of defense against wiping the wrong disk. See architecture.md § Safety.

Network

Owner: agent. What it does: iperf3 -c <orchestrator> -p <iperf_port> -t 10 -J to measure throughput to the orchestrator. The orchestrator-side iperf3 -s is supervised by internal/orchestrator/iperf.go and binds to the configured network.iperf_port. Pass: throughput ≥ per-class floor (1 Gbps for 1GbE NICs, 9 Gbps for 10GbE). Artifacts: iperf-<nic>.json.

GPU

Owner: agent. What it does: runs nvidia-smi -q and a short compute workload (gpu-burn if present, else nvidia-smi dmon during a stress-ng --gpu burst). Skipped cleanly when no GPU is present. Pass: no ECC errors reported; temperature below threshold; compute workload exits 0.

PSU

Owner: agent. What it does: reads /sys/class/hwmon/*/power_average and in*_input during a synthetic load burst (CPU + disk + NIC simultaneously) to look for voltage sag or wattage anomalies. Records the full envelope as measurements rows with kind=psu. Pass: no voltage dip below threshold across the load burst. Caveat: only reports on what the BMC exposes via hwmon — servers without exposed PSU telemetry pass trivially. Documented limitation.

Reporting (orchestrator-owned)

Owner: orchestrator (resolves inline inside the /result for PSU). What it does:

Gathers run, host, stages, spec_diffs, and measurement aggregates.
Renders report.html via internal/report (html/template with inlined CSS; self-contained offline-viewable).
Writes report.json with the same data in machine-readable form.
Records both as report_html / report_json artifact rows.
Transitions run → Completed.
Fires RunCompleted notification.
The next agent heartbeat returns cmd=shutdown.

Thermal sidecar

Owner: agent (always-on from Booting until the agent exits). What it does: every 5 seconds, walks /sys/class/hwmon/* and POSTs temperature samples as a batch to /sensor. Populates the measurements table with kind=thermal. No pass/fail on its own — stages that care about thermals read the sidecar's data via measurements. A dead sensor just drops out of the next batch.

Where pass/fail lives

runs.state — authoritative terminal state (Completed, FailedHolding, Released).
runs.result — pass or fail string once the run completes.
runs.failed_stage — name of the stage that halted the pipeline, if any. Cleared when the operator overrides and re-enters.
stages — one row per attempted stage with passed, started_at, completed_at, summary_json, message.
measurements — time-series samples from the thermal sidecar and from stages that capture numeric outputs.
artifacts — on-disk files (report, fio logs, iperf logs, etc).
spec_diffs — one row per expected-vs-actual divergence.

Adding a new stage

Add the name to store.DefaultStageOrder.
Add a model.State<Name> const and wire it into internal/orchestrator/statemachine.go (both the forward transition table and the stage-for-state lookup).
Add a case to agent/runner.go's runStage dispatch.
Drop the implementation into agent/tests/.
If the stage is orchestrator-owned, add a resolve<Name> helper to internal/api/agent_handlers.go and invoke it from the /result handler after the preceding stage's NextState resolves.

6.4 KiB Raw Blame History