Commit Graph

11 Commits

Author SHA1 Message Date
josh 5c00edd7b6 ui: fix htmx-ext-sse integrity hash (was silently blocked by browser)
CI / Lint + build + test (push) Successful in 1m20s
Release / release (push) Successful in 5m48s
Detail-page pipeline + log panes weren't updating without a manual
refresh. Root cause: the integrity attribute on htmx-ext-sse@2.2.2
in layout.templ was wrong, so the browser refused to execute the
script (SRI enforcement is silent — no user-visible error unless
you open devtools). htmx core loaded, boosted nav worked, forms
worked — but sse-connect/sse-swap were inert because the extension
never registered, so no EventSource was ever opened.

Replaced the claimed hash (Y4gc0CK6...) with the real one
(fw+eTlCc...) computed via
  curl -sL https://unpkg.com/htmx-ext-sse@2.2.2 |
  openssl dgst -sha384 -binary | openssl base64 -A

Added sse_e2e_test.go as a regression canary that mounts the real
chi router (RealIP + Recoverer + Logger middleware), opens
GET /events, publishes a tile-update via Runner, and asserts the
event lands on the wire. Server-side unit tests only verified
rendered HTML — this one covers the full publish→wire path, which
is what the next regression in this area will hit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 17:51:58 -04:00
josh cdd6cae3b0 ui: keep detail-page SSE swaps live after the first outerHTML replace
CI / Lint + build + test (push) Successful in 1m28s
Release / release (push) Successful in 6m29s
Pipeline fragment payload was a bare <div class=pipeline>, but the
sse-swap=pipeline-N wrapper lived only in the page shell. The first
outerHTML swap destroyed the wrapper, so every subsequent pipeline
event had nothing to target — forcing a manual refresh. RenderPipelineString
now emits the full <section id=pipeline-N sse-swap=... hx-swap=outerHTML>
wrapper, used from both the shell and the orchestrator publish path.

Also drop the red-bar styling from the empty DetailHold placeholder:
the wrapper's detail-hold class was painting an unconditional red band
between Pipeline and Actions whenever no hold was active.
2026-04-18 17:03:39 -04:00
josh 0db790ae3e ui: stream host-detail fragments over SSE so the page updates live
CI / Lint + build + test (push) Successful in 1m29s
Release / release (push) Has been cancelled
The detail page was only partly live: Pipeline + LogTabs subscribed to
SSE, but the summary header, actions row, spec-diffs list and hold-key
block all froze at page-load and required a manual refresh to catch up
with state changes.

Extract each of those four regions into its own named templ component
with a stable id and sse-swap target, add Render*String helpers so the
orchestrator can publish pre-rendered fragments, and register a
HostDetailRenderer alongside the existing Tile/Pipeline renderers.
PublishHostDetail is folded into publishTileUpdate so every call site
that already refreshes a tile now also refreshes the detail page —
keeps the fan-out honest without scattering new publish calls.

The empty-state wrappers for spec-diffs and hold are load-bearing:
without the <section id=... sse-swap=...> present at initial GET, the
first live event after SpecValidate or Hold writes would have no DOM
node to swap into.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 16:36:13 -04:00
josh 4524ab8dc0 runs: add non-destructive flag + operator Cancel button
CI / Lint + build + test (push) Successful in 2m5s
Release / release (push) Successful in 3m5s
Non-destructive pre-declares "don't touch the disks" on Start: the
Storage stage skips wipe-probe, badblocks -w, and write-mode fio,
and reports a read-only summary. Runs a new non_destructive column;
threaded through Claim → agent tests.Deps → Storage stage.

Cancel halts an in-flight run. The orchestrator transitions to a
new StateCancelled via TriggerOperatorCancelled (valid from any
active state); the agent's next heartbeat returns cmd=cancel_stage,
which fires a stored CancelFunc on the per-stage context. Stage
subprocesses spawned with exec.CommandContext die with the context,
the agent posts a cancelled outcome, then powers the host off.

Destructive stages mid-run may leave the host in an intermediate
state — the UI confirm dialog warns the operator; recovery is
manual for now.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 13:01:42 -04:00
josh d0bfae14c8 Heartbeat-first dispatch: retire WoL-as-default, add WaitingReboot
CI / Lint + build + test (push) Has been cancelled
Every supported host runs vetting-reporter in-OS and heartbeats every
30s. WoL was never the thing that started vetting — the heartbeat
response's reboot_for_vetting command was. Firing WoL first only
crowded the run log with misleading diagnostics when the real failure
mode is "reporter isn't installed."

- StartRun 409s if the host hasn't heartbeated within 60s, pointing
  the operator at /register/quick.sh.
- Dispatcher re-checks LastSeenAt at dispatch time (run may sit in
  Queued long enough for the host to go offline); stale hosts mark
  the run Failed with failed_stage=dispatch instead of looping.
- New StateWaitingReboot + TriggerRebootCommanded capture the actual
  semantics. StateWaitingWoL kept as the hook point for a future
  manual-override button.
- Tile disables the Start button with a quick.sh tooltip when the
  host is offline, matching the server-side 409.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 01:10:34 -04:00
josh 1694c20b12 Host detail v2: full pipeline + per-stage logs + WoL diagnostics
CI / Lint + build + test (push) Has been cancelled
Pipeline now always renders all 13 nodes (3 pre-stage + 9 stage +
Completed), synthesising ghosts from run state when stage rows
aren't seeded yet. Makes a WaitingWoL host show the full timeline
ahead of it instead of just 4 dots.

Agent tags each log line with its stage; logs.Hub fans out to both
log-{runID} and log-{runID}-{stage} SSE events so the detail page
can show per-stage tabs with a pure-CSS radio-sibling switch. Flat
run log prepends [stage] so grep still works.

Dispatcher writes picked/sent-WoL/heartbeat lines into the per-run
log — the operator opens the detail page, sees WaitingWoL stuck,
and reads exactly what the dispatcher did and why nothing's
progressing, instead of having to tail journalctl on the LXC.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-18 00:38:27 -04:00
josh bb658a8435 Host detail page + pipeline timeline
CI / Lint + build + test (push) Has been cancelled
Click a tile to open /hosts/{id} — the canonical control surface per
host. Timeline renders every pre-stage, stage, and terminal node in
order, with the current one pulsing, failed ones flagged, and
downstream ones dimmed as skipped. Detail page shows summary, hold
card (when holding), all action buttons, spec diffs, a full-height
log pane, and a collapsed expected-spec YAML.

Tile slims to name, last-seen, status, and one primary action; a
CSS-overlay <a> makes the whole card clickable while buttons stay
receptive via z-index.

Runner.publishTileUpdate now also emits pipeline-{runID} fragments,
and CompleteStage wraps Stages.CompleteByName so stage completions
advance the timeline live — without this the dots only moved on
state transitions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-17 23:59:43 -04:00
josh a0c0fb114f Add host-mode heartbeat: vetting-agent host + last-seen badge
CI / Lint + build + test (push) Has been cancelled
vetting-agent gains a `host` subcommand that runs as a systemd service
installed by the quick-register one-liner, POSTing every 30s to
/api/v1/hosts/{mac}/heartbeat so the dashboard tile shows "online" or
"Nm ago" without waiting on WoL. Ships dormant client code for the
Phase 2 reboot_for_vetting command so the server can flip it on later
without a binary redeploy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-17 23:34:15 -04:00
josh 8b3d9a312e Add quick-register one-liner for target-host registration
CI / Lint + build + test (push) Failing after 5m15s
Operator pastes `curl -fsSL $ORCH/register/quick.sh | sudo bash` on the
target host (pre-wipe). The script probes MAC + CPU/RAM/disks/NICs/GPUs,
emits an expected-spec YAML, and POSTs to a new LAN-trusted JSON
endpoint /api/v1/hosts. The register page shows the command prefilled
with the orchestrator URL; the manual form moves into a collapsible
"Register manually" disclosure.
2026-04-17 22:50:54 -04:00
josh 42da48864f Remove operator auth — trust the LAN
CI / Lint + build + test (push) Failing after 5m15s
Can't log in from a fresh LXC deploy, and the service is LAN-only by
design. Rip out the whole bcrypt-password / signed-cookie session
layer: internal/auth, login templates, gen-admin-password binary +
Makefile targets, auth config block, login/logout routes and the
RequireSession middleware wrap. Agent bearer-token auth on
/api/v1/runs/{id}/* is untouched.

Operators who want a password can front the service with a reverse
proxy — noted in README and docs/operations.md.
2026-04-17 22:31:49 -04:00
josh 9bb4b09a04 Initial commit: full Phases 1-6 implementation
CI / Lint + build + test (push) Has been cancelled
Post-repair hardware validation pipeline for Proxmox cluster hosts.
Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq
PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
2026-04-17 21:32:10 -04:00