Self Hosting and Natural Disasters

I’m a huge proponent of self hosting and data ownership. I host my website, my cloud file syncing, my email – everything. All of it runs on some servers in a server rack in my house. This let’s me control and own my data without relying on third parties. Sure it’s more work for me and I have to spend time here and there maintaining it, making sure backups are working and everything is up to date, but I kind of like to do it. Plus given I use Salt it’s mostly an entirely automated process anyway.

There is one big problem I do have with self hosting though and it’s that I live in an area that is prone to very destructive hurricanes. This does make self hosting things at home very problematic when it’s entirely possible that I could lose my email and important documents during a time when I probably need both the most. It also means that if my internet goes down, no email – and while sure I wont really lose any email, not getting any is also a problem.

So ultimately I had put together a “Hurricane Preparedness” plan for my data, which would involve shifting my Nextcloud and Email services to a VPS somewhere while I would go out and batten down the hatches in preparation for the oncoming weather onslaught. I had not actually ever had to enact the plan luckily, we’ve had a few lucky years with avoiding any hurricanes in general. But I knew at some point I should just run through the process to test it (really you should also test your backups).

What I learned from the process are these things:

  1. I had no automation to deploy Nextcloud and it took a long time to re-create exactly what I needed to deploy it. I’m glad I didn’t have to do it in a rush
  2. I didn’t even consider having to migrate Bitwarden up to the cloud but losing that would be a tremendous issue.
  3. Email is insanely critical and losing access to it can be absolutely crippling in an emergency, that said the mailcow backup/restore process worked great and it was very easy to move.
  4. Raw block storage is really fuckin expensive

Ultimately what I realized during this was that my plan was just about a failure and really the last thing I want to do while I am getting ready for a hurricane is to worry about making sure my email is working (fucking DNS) and I’m not losing my password manager.

So I’ve decided that I’m going to just shift my email and password manager to permanently live on a VPS – this ultimately saves me from even having to worry about moving it and also stops me from losing these services if I lose power or internet at home. I consider these two things to be mission critical and losing access to them for even an hour is an extreme problem.

I’m still hosting my nextcloud locally because it is very expensive to run in the cloud with raw block storage. Also given my local setup with a raid disk array with replicated ZFS snapshots, it does feel very safe. But I have spent time automating my local to cloud migration to where it’s entirely automated now. One single salt formula will run the entire process and in a few hours it will be up and running in the cloud. It’s not perfect and not as nice as not having to worry about it, but given the cost, I think this one component is a fair tradeoff.

I’m not sure people always think about this fact that a natural disaster could take out their home hosted data. And even when I did think I had a plan ultimately it wasn’t a very good one once I put it into practice.

Slow Down During Failures

A few weeks ago one of my primary VM host nodes experienced a disk failure on the disk the Hypervisor was installed on. This normally wouldn’t really be that big of a deal, as I keep a second server as a cold spare but I ran into a few problems.

I keep my main two compute nodes in a Proxmox cluster. The way it essentially works is by using a tool called corosync to sync the configuration data of VMs and Containers to all of the nodes. Each node then essentially has a copy of all this information so that in the event of a node failure, you can just move that configuration onto the new node. This process is somewhat simple but has worked great.

The only real caveat of doing this is that the cold spare is obviously off most of the time, so it does require a boot up to sync every now and then. I don’t really consider this a big problem as it should be booted up and patched every so often anyway.

Unfortunately though things have been busy and I have generally neglected taking care of that cold spare, and when that disk failed on my hot node, I knew I was in trouble. I was not concerned exactly about losing any data – all of that is stored on a mirrored ZFS array. But I knew I lost all the metadata about the VMs (think the config file specifying the details about the vm memory, disk, cpu, networking ect).

Mistake Number One: Keep your backups and maintain your recovery strategies.

At this point I shut down the dead hot server, moved all the data disks to the new one and booted up the new server to see where I was. Looking at the corosync data it was very obvious that it had easily been two months since I last booted up this machine as the VM config files were grossly out of date and many of the changes were missing.

I did luck out in that some of the more consistent stuff was still there (like my mail server which is super important and really hasn’t changed in a long time) – so I was able to move those over and get them back online fairly quickly.

The two biggest issues was that I had migrated from a Kubernetes setup to a basic VM/Compose setup and had shutdown my Gitlab some other associated instances. So I knew I had to re-create the Docker VM config file. I did have a template I could use (really just copying another VMs config and adjusting it) so I set about that.

While doing I realized I had a bunch of VM disk datasets that were no longer applicable for anything and thought that was a good time to clean those things up. Unfortunately for me, my docker VM was brand new and I wasn’t used to seeing it and since I was already stressed and in a hurry I didn’t realize that I was cleaning that VM up as well.

Mistake Number Two: Don’t do cleanup during a failure recovery phase.

Thankfully I had at least spent some time the previous week setting up a deployment pipeline to push out changes to this VM, so once I realized my mistake, I was able to spin up a new VM and at least get the configuration stuff replaced (mainly Traefik which routes EVERYTHING) but I had lost my website and accidentally wiped out my Bitwarden database.

Mistake Number Three: Not having backups.

All in all it was messy but my overall recovery strategy proved that it does work as long as you maintain it, and I discovered a few problems that I realized I needed to address before they became a real issue (like not having backups :facepalm:) so losing my website wasn’t the end of the world as it could have been a lot worse.