An extended power loss for my primary Proxmox 8 cluster, while I was remote, took half of my cluster nodes out of commission into an unbootable state. This unbootable half of the cluster would not show up on the network after the power came back even with manual physical rebooting. The other half would boot up and show on the network. All the nodes had a second problem that they would not open a PVE WebUI Console Shell or show any output on any of the video output ports for either the Nvidia PCIe GPU or the Intel iGPU. So I have to figure out what looks to be a set of overlapping issues and clean up this mess. There were several lessons learned and re-learned along the way.
First, I need a “crash cart” to recover these to a bootable state. What is a “crash cart”, well that is usually a rolling cart found in a data center that you roll up to a broken server. They typically include some sort of serial terminal and/or a monitor, keyboard and mouse with a lot of connectors and adapters to hook up to random port for the equipment you are fixing. Mine includes adapters for VGA, DVI, DisplayPort, HDMI and both USB and PS/2 keyboard and mice. I’ve even thrown in a spare known good Nvidia K600 video card for troubleshooting graphic cards. A trusty and up to date Ventoy Bootable USB is sitting on there as well. I have a laptop that I could use for a serial terminal if we get to that point but I was hoping I didn’t need it since those are mostly for network equipment.
Here is my quickly thrown together trash can crash cart (TC3) for this adventure.
My earlier post for ProxMox 8.2.2 Cluster on Dell Wyse 3040s mentioned the tight constraints of the cluster both with RAM and DISK space. There are some extra steps involved in keeping a very lean Proxmox 8 cluster running on these extremely resource limited boxes. I am running Proxmox 8.2 and Ceph Reef on them which leaves them slightly under resourced as a default. So when the Ceph would not start up the Ceph Monitors after my upgrade from Proxmox 8.2.2 to 8.2.4, I had to dig a bit to find the problem.
Ceph Monitor will not start up if there is not at least 5% free disk space on the root partition. My root volumes were sitting right at 95% used. So our story begins…