McGarrah Technical Blog

Posts in category "troubleshooting"

Debian 12 SystemD nightly reboots on Dell Wyse 3040s

My super lean Proxmox 8.3 testbed cluster running Ceph occasionally just decides to lockup a node based on it being incredibly limited on RAM and CPU. As much as I hate rebooting Linux/UNIX systems, this is a case where a nightly reboot of the nodes might help with reliability.

Ceph Cluster Complete Removal on Proxmox for the Homelabs

My test Proxmox Cluster is used for testing and along the way I broke the Ceph Cluster part of it badly while doing a lot of physical media replacements. The test cluster is the right place to try out risky stuff instead of on my main cluster that is loaded up with my data. Fixing it often teaches you something but in this case I already know the lessons and just want to fast track getting a clean ceph cluster back online.

I need it back in place to test the Proxmox 8.2 to Proxmox 8.3 upgrade of my main cluster. So this is a quick guide on how to completely clean out your Ceph Cluster installation as if it never existed on your Proxmox Cluster 8.2 or 8.3 environment.

proxmox ceph install dialog

Proxmox VE 8.1 to 8.2 upgrade issues in the Homelabs

An extended power loss for my primary Proxmox 8 cluster, while I was remote, took half of my cluster nodes out of commission into an unbootable state. This unbootable half of the cluster would not show up on the network after the power came back even with manual physical rebooting. The other half would boot up and show on the network. All the nodes had a second problem that they would not open a PVE WebUI Console Shell or show any output on any of the video output ports for either the Nvidia PCIe GPU or the Intel iGPU. So I have to figure out what looks to be a set of overlapping issues and clean up this mess. There were several lessons learned and re-learned along the way.

First, I need a “crash cart” to recover these to a bootable state. What is a “crash cart”, well that is usually a rolling cart found in a data center that you roll up to a broken server. They typically include some sort of serial terminal and/or a monitor, keyboard and mouse with a lot of connectors and adapters to hook up to random port for the equipment you are fixing. Mine includes adapters for VGA, DVI, DisplayPort, HDMI and both USB and PS/2 keyboard and mice. I’ve even thrown in a spare known good Nvidia K600 video card for troubleshooting graphic cards. A trusty and up to date Ventoy Bootable USB is sitting on there as well. I have a laptop that I could use for a serial terminal if we get to that point but I was hoping I didn’t need it since those are mostly for network equipment.

Crash Cart

Here is my quickly thrown together trash can crash cart (TC3) for this adventure.

Ceph Cluster rebalance issue

This is rough draft that I’m just pushing out as it might be useful to someone not stay in my drafts folder forever… Good enough beats Perfect that never ships every time.

I think I have mentioned my ProxMox/Ceph combo cluster in an earlier post. A quick summary is it consists of a five (5) node cluster for ProxMox HA and three of those nodes have Ceph with three (3) OSDs each for a total of nine (9) 5Tb OSDs. They are in a 3/2 ceph configuration with three copies of each piece of data allowing for running if two nodes are active. Those OSD / hard drives have been added in batches of three (3) with one added on each node as I could get drives cleaned and available. So I added them piece meal in a sets of three OSDs, then three more and finally the last batch of three. I’m also committing the sin of not using 10Gbps SAN networking for the Ceph cluster and using 1Gbps so performance is impacted.

Adding them in pieces as I also loaded up the CephFS with media content is what is hurting me now. My first three OSDs that are spread across the three nodes are pretty full at 75-85% and as I added the next batches, the cluster has never fully caught up and rebalanced the initial contents. This impacts the results of my ‘ceph osd df tree’ results showing I have less space then I actually have available.

Something that I’m navigating is Ceph will go into read-only mode when you approach the fill limits which is typically 95% of space available. It starts alerting like crazy at 85% filled with warning of dire things coming. Notice in my OSD status below that I have massive imbalances between the initial OSDs 0,1,2 versus 3,4,5 and 6,7,8.

Ceph OSD Status