McGarrah Technical Blog

Posts tagged with "troubleshooting"

Ceph Cluster Complete Removal on Proxmox for the Homelabs

My test Proxmox Cluster is used for testing and along the way I broke the Ceph Cluster part of it badly while doing a lot of physical media replacements. The test cluster is the right place to try out risky stuff instead of on my main cluster that is loaded up with my data. Fixing it often teaches you something but in this case I already know the lessons and just want to fast track getting a clean ceph cluster back online.

I need it back in place to test the Proxmox 8.2 to Proxmox 8.3 upgrade of my main cluster. So this is a quick guide on how to completely clean out your Ceph Cluster installation as if it never existed on your Proxmox Cluster 8.2 or 8.3 environment.

proxmox ceph install dialog

Diagnosing a broken microwave

My relatively new microwave just stopped heating things for no apparent reason one morning. We bought it a couple years back (about 3 years ago), so I was not happy as I expect these to work awhile with several working for ten (10) plus years. We picked up a new one from the local white box retailer as we wanted a replacement quickly. But my wife while digging around on Youtube found Microwave works but wont heat - Cheap and easy fix which was exactly what we experienced.

That video said it was likely a fuse or diode which are both cheap enough that they are worth an attempt at fixing. That will give me an extra microwave for the kids to use upstairs if I can fix it and save some landfill space.

fuse diode
Fuse Diode

HP ProCurve Switch Java WebUI

“Don’t bury the lede”

A working HP ProCurve Java WebUI screenshot to show that I got it working.

ProCurve WebUI

My earlier post HP ProCurve 2800 initial setup discussed an initial configuration of a network switch and mentioned in passing that I got the ProCurve Java WebUI working in a relatively safe manner. Here is how I put that together on a modern machine running Windows 10 Professional 64-bit.

WARNING: It should go without saying that you should not use the FireFox Web Browser from January 2017 that we are setting up here for the very old Java Web App supported on hardware released in 2004 and EOL in 2013 on the public internet. You will be hacked without a doubt in seconds. These are completely unpatched versions of two very very very old pieces of software. You have been duly warned.

Powerline Networking for the Homelabs

I inherited, from a stack of old junk hardware, two Netgear Powerline 500 Nano XAVB5101 plugs. I thought I would try it out for a quick network connection between two floors in my new house using the existing power cabling.

Powerline NIC

Wow did I learn a lesson in a combination of networking and electrical power the hard way… with a repeatedly blown breaker.

Proxmox VE 8.1 to 8.2 upgrade issues in the Homelabs

An extended power loss for my primary Proxmox 8 cluster, while I was remote, took half of my cluster nodes out of commission into an unbootable state. This unbootable half of the cluster would not show up on the network after the power came back even with manual physical rebooting. The other half would boot up and show on the network. All the nodes had a second problem that they would not open a PVE WebUI Console Shell or show any output on any of the video output ports for either the Nvidia PCIe GPU or the Intel iGPU. So I have to figure out what looks to be a set of overlapping issues and clean up this mess. There were several lessons learned and re-learned along the way.

First, I need a “crash cart” to recover these to a bootable state. What is a “crash cart”, well that is usually a rolling cart found in a data center that you roll up to a broken server. They typically include some sort of serial terminal and/or a monitor, keyboard and mouse with a lot of connectors and adapters to hook up to random port for the equipment you are fixing. Mine includes adapters for VGA, DVI, DisplayPort, HDMI and both USB and PS/2 keyboard and mice. I’ve even thrown in a spare known good Nvidia K600 video card for troubleshooting graphic cards. A trusty and up to date Ventoy Bootable USB is sitting on there as well. I have a laptop that I could use for a serial terminal if we get to that point but I was hoping I didn’t need it since those are mostly for network equipment.

Crash Cart

Here is my quickly thrown together trash can crash cart (TC3) for this adventure.

ProxMox 8.2.4 Upgrade on Dell Wyse 3040s

My earlier post for ProxMox 8.2.2 Cluster on Dell Wyse 3040s mentioned the tight constraints of the cluster both with RAM and DISK space. There are some extra steps involved in keeping a very lean Proxmox 8 cluster running on these extremely resource limited boxes. I am running Proxmox 8.2 and Ceph Reef on them which leaves them slightly under resourced as a default. So when the Ceph would not start up the Ceph Monitors after my upgrade from Proxmox 8.2.2 to 8.2.4, I had to dig a bit to find the problem.

Proxmox SFF Cluster

Ceph Monitor will not start up if there is not at least 5% free disk space on the root partition. My root volumes were sitting right at 95% used. So our story begins…

Sharing file systems between WSLv2 instances

I have a significant investment in my WSLv2 Ubuntu 22.04.3 LTS installation. It has my Nvidia GPU setup nicely integrated and several machine learning demos and tests I’ve built and use for keeping current on machine learning. With Ubuntu 24.04 LTS released, I now want to play around in the newer version but don’t want to move or worse copy my entire set of models and repositories across. I have well over 500Gb of content and absolutely don’t want two copies of those floating around. I’m looking for a solution to this and figure others have encountered it.

Explorer WSL Filesystems

Running Github Pages locally

How to run Github Pages locally in my Microsoft Windows 10 Pro WSLv2 Ubuntu 22.04 LTS environment and using Visual Studio Code to modify the contents. I’m not a Ruby or Jekyll expert by any means but just wanted a quick guide on running my Github Pages website locally to review them before pushing to this website. Seemed like an easy enough thing but there were a couple of hiccups to sort out so thought I’d write them down for future me when I try this again.

This should also lets me test out new plugins, new versions and changes to templates without breaking the public website. I’m still sorting out how to do the abstracts and formatting of the archive pages correctly.

Ceph Cluster rebalance issue

This is rough draft that I’m just pushing out as it might be useful to someone not stay in my drafts folder forever… Good enough beats Perfect that never ships every time.

I think I have mentioned my ProxMox/Ceph combo cluster in an earlier post. A quick summary is it consists of a five (5) node cluster for ProxMox HA and three of those nodes have Ceph with three (3) OSDs each for a total of nine (9) 5Tb OSDs. They are in a 3/2 ceph configuration with three copies of each piece of data allowing for running if two nodes are active. Those OSD / hard drives have been added in batches of three (3) with one added on each node as I could get drives cleaned and available. So I added them piece meal in a sets of three OSDs, then three more and finally the last batch of three. I’m also committing the sin of not using 10Gbps SAN networking for the Ceph cluster and using 1Gbps so performance is impacted.

Adding them in pieces as I also loaded up the CephFS with media content is what is hurting me now. My first three OSDs that are spread across the three nodes are pretty full at 75-85% and as I added the next batches, the cluster has never fully caught up and rebalanced the initial contents. This impacts the results of my ‘ceph osd df tree’ results showing I have less space then I actually have available.

Something that I’m navigating is Ceph will go into read-only mode when you approach the fill limits which is typically 95% of space available. It starts alerting like crazy at 85% filled with warning of dire things coming. Notice in my OSD status below that I have massive imbalances between the initial OSDs 0,1,2 versus 3,4,5 and 6,7,8.

Ceph OSD Status