Commit Graph

5 Commits

Author SHA1 Message Date
josh 3656af9823 feat(end-of-run): reboot to local disk instead of powering off
CI / Lint + build + test (push) Successful in 1m47s
Release / release (push) Successful in 10m8s
Completed runs now reboot the host and fall through iPXE to the next
boot device (local disk) instead of powering off. Three coordinated
changes:

- pxe/ipxe: NoActiveRunScript exits iPXE (drops to next boot entry)
  instead of `sleep 10; poweroff`. Without this, a Completed reboot
  just loops through PXE and gets told to poweroff.
- api/agent_handlers: heartbeat returns cmd=reboot (was cmd=shutdown)
  when the run reaches Completed.
- agent/runner: runs `systemctl reboot` (with `shutdown -r now`
  fallback) in response to cmd=reboot.

Operator cancel still powers off — powerOffAndReturn is unchanged
because a cancel means the operator wants the host idle so they can
walk up to it, not back in rotation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-19 22:45:11 -04:00
josh 026923075c pxe: disable systemd-firstboot so the live image doesn't prompt
CI / Lint + build + test (push) Successful in 1m22s
Release / release (push) Has been cancelled
systemd-firstboot.service is an interactive wizard that asks for
locale, timezone, and root password when /etc/machine-id isn't
populated — i.e. every PXE boot of a mkosi-built image. It sits on
sysinit.target waiting for input that will never arrive, blocking
the agent service and every other downstream unit indefinitely.

systemd.firstboot=off on the kernel cmdline is the documented kill
switch; no image-side changes needed.
2026-04-18 15:35:24 -04:00
josh c45349f62c pxe: mask serial-getty@ttyS0 so hosts without serial don't wait 90s
CI / Lint + build + test (push) Successful in 1m47s
Release / release (push) Successful in 5m16s
systemd-getty-generator reads console=ttyS0 off the kernel cmdline and
auto-creates serial-getty@ttyS0.service, which BindsTo dev-ttyS0.device.
On hardware without a physical serial port the device node never shows
up, systemd waits its full default 90s timeout, and only then proceeds.

systemd.mask= on the kernel cmdline is a first-class option — masks
the unit before the generator's link even gets activated. Kernel
messages still go to ttyS0 if a port is present; we just don't try
to spawn a login prompt there.
2026-04-18 14:47:03 -04:00
josh a88e24bef4 live-image: real /init + verbose boot for first-boot diagnosis
CI / Lint + build + test (push) Successful in 1m23s
Release / release (push) Successful in 4m49s
Host boots past kernel init and then stalls silently. ACPI DSDT error
about TXHC.RHUB.SS01 is benign noise (Tiger Lake firmware bug) — the
actual problem is that nothing between kernel handoff and (maybe)
systemd is visible on the console.

Two changes:

1. Replace the /init → sbin/init symlink with a real shell script
   (live-image/mkosi.extra/init) that mounts /proc /sys /dev /dev/pts
   /dev/shm /run before execing systemd. Systemd has fallback mount
   code for these, but when it fails the failure is silent. Doing it
   explicitly in /init keeps failures visible and avoids the fragile
   symlink-resolution trick.

2. Drop 'quiet' from the kernel cmdline and add loglevel=7 plus
   systemd.log_target=kmsg + journald.forward_to_console=1 so every
   early-boot message reaches both tty0 and ttyS0. Will be dialed
   back once boot is stable.

Also: .gitattributes pins LF on live-image/, .gitea/, Makefile, and
*.sh so Windows checkouts don't break shell scripts and Makefile
recipes with CRLF. /init also gets chmod 0755 in repack-initrd as a
belt-and-braces against mode loss on non-Linux checkouts.
2026-04-18 14:31:40 -04:00
josh 9bb4b09a04 Initial commit: full Phases 1-6 implementation
CI / Lint + build + test (push) Has been cancelled
Post-repair hardware validation pipeline for Proxmox cluster hosts.
Go orchestrator + in-image agent + mkosi live image + bundled dnsmasq
PXE + SQLite + HTMX/SSE UI + notify registry + janitor + full docs.
2026-04-17 21:32:10 -04:00