A few weeks ago one of my primary VM host nodes experienced a disk failure on the disk the Hypervisor was installed on. This normally wouldn’t really be that big of a deal, as I keep a second server as a cold spare but I ran into a few problems.

I keep my main two compute nodes in a Proxmox cluster. The way it essentially works is by using a tool called corosync to sync the configuration data of VMs and Containers to all of the nodes. Each node then essentially has a copy of all this information so that in the event of a node failure, you can just move that configuration onto the new node. This process is somewhat simple but has worked great.

The only real caveat of doing this is that the cold spare is obviously off most of the time, so it does require a boot up to sync every now and then. I don’t really consider this a big problem as it should be booted up and patched every so often anyway.

Unfortunately though things have been busy and I have generally neglected taking care of that cold spare, and when that disk failed on my hot node, I knew I was in trouble. I was not concerned exactly about losing any data – all of that is stored on a mirrored ZFS array. But I knew I lost all the metadata about the VMs (think the config file specifying the details about the vm memory, disk, cpu, networking ect).

Mistake Number One: Keep your backups and maintain your recovery strategies.

At this point I shut down the dead hot server, moved all the data disks to the new one and booted up the new server to see where I was. Looking at the corosync data it was very obvious that it had easily been two months since I last booted up this machine as the VM config files were grossly out of date and many of the changes were missing.

I did luck out in that some of the more consistent stuff was still there (like my mail server which is super important and really hasn’t changed in a long time) – so I was able to move those over and get them back online fairly quickly.

The two biggest issues was that I had migrated from a Kubernetes setup to a basic VM/Compose setup and had shutdown my Gitlab some other associated instances. So I knew I had to re-create the Docker VM config file. I did have a template I could use (really just copying another VMs config and adjusting it) so I set about that.

While doing I realized I had a bunch of VM disk datasets that were no longer applicable for anything and thought that was a good time to clean those things up. Unfortunately for me, my docker VM was brand new and I wasn’t used to seeing it and since I was already stressed and in a hurry I didn’t realize that I was cleaning that VM up as well.

Mistake Number Two: Don’t do cleanup during a failure recovery phase.

Thankfully I had at least spent some time the previous week setting up a deployment pipeline to push out changes to this VM, so once I realized my mistake, I was able to spin up a new VM and at least get the configuration stuff replaced (mainly Traefik which routes EVERYTHING) but I had lost my website and accidentally wiped out my Bitwarden database.

Mistake Number Three: Not having backups.

All in all it was messy but my overall recovery strategy proved that it does work as long as you maintain it, and I discovered a few problems that I realized I needed to address before they became a real issue (like not having backups :facepalm:) so losing my website wasn’t the end of the world as it could have been a lot worse.