Replacing a Proxmox boot drive

2024-04-18

11 minutes

#backups #self-hosting #server-2020

My server boots off a pair of SSDs, tied together in a ZFS mirror. Proxmox supports ZFS boot (thanks to its Ubuntu kernel), and lets you configure it through the installer. When I installed my server, this is what I wanted, as it meant the OS and data disks for the contained LXCs and VMs were all nicely backed up and versioned.

However, one afternoon I logged in to my server and noticed the worst - some of the drives were failing.

At first, it was just a handful of IO errors on a single drive. These SSDs have been running non-stop for a few years now, so I'm not surprised they were having issues. However, the Total Bytes Written (TBW) was still half what the drives are rated for. Knowing the drive was slowly failing, but I still had a second working just fine, I started looking for replacement drives, and a plan for swapping it in (it's more complex than you'd think, but I'll get on to that).

But, it's never really that simple. The next day, I logged in to the machine to check on things, and was met by this:

NAME                                       STATE     READ WRITE CKSUM
rpool                                      DEGRADED     0     0     0
  mirror-0                                 DEGRADED    26     0     0
    ata-CT500MX500SSD1_XXXXXXXXXXXX-part3  DEGRADED    28     0     2  too many errors
    ata-CT500MX500SSD1_YYYYYYYYYYYY-part3  FAULTED     17     0     1  too many errors

One drive was so bad that ZFS removed it from the pool, and the other had started exhibiting similar IO-related issues. Oddly enough, the "FAULTED" drive was the perfectly working one the day before.

#Short-term solution

With 2 drives showing signs of failure, I needed a solution pretty quickly. I could order new drives and wait for them to arrive, but given how quickly the pool has taken a sudden turn, who knows if I have that long. I've had issues in the past with SATA cables causing IO issues, so I tried swapping those out first, but no dice.

Instead, I rummaged around my random drive draw (come on, we all have one), and picked out the only 2 compatible drives I have. One 750GB SSD, and one 500GB HDD. Both very used, but they were the only drives I had. I'd only removed the 750GB SSD from my old laptop the night before, so the timing ended up working out pretty well.

I'm well aware that a mixed-media pool is a terrible idea, especially with a mirror, but it was the options I had.

At first, I just resilvered the 750GB SSD in to replace the faulted drive, cleared the errors, and ran a scrub. There were quite a few checksum errors, but the pool remained otherwise stable. To avoid a mixed-media pool, and keep the aged HDD far away from my critical data, I left it at that for a day so I could interrogate the failed SSD.

As it turned out, not only had the drive faulted, but the entire partition table had disappeared. I don't know how, or why, but that drive clearly wasn't happy - so it had to go.

#Proxmox ZFS-on-root replacement

My proxmox installation is on a pair of 500GB SSDs, running ZFS. These disks house the Proxmox OS alongside all the OS drives for the various VMs and LXCs I run. When I built the system, I wanted that storage redundant, so I could replace a single drive easily without downtime.

A few years in, I made an interesting observation however. The drives contain more than just a single ZFS partition - there's also boot records. This makes perfect sense, as the UEFI needs to know how to boot to the drives. But when doing a simple zfs replace, how does the boot information get updated? At the time, I put this down to "there must be a way", and left it at that. Until today, when the need surfaced. So, I looked it up.

Proxmox conveniently have both tooling and documentation to help manage a ZFS-based boot pool, and replacing a disk is surprisingly simple. proxmox-boot-tool is a small command-line tool that ships with Proxmox to format and manage these boot partitions, and ensure they're up-to-date and configured correctly. Armed with some documentation, let's replace a boot drive:

Connect the new drive
Copy the partition table from a working drive to the new drive: sudo sgdisk /dev/disk/by-id/<existing-disk> -R /dev/disk/by-id/<new-disk>
Randomize the partition IDs, so the OS can tell the drives apart: sudo sgdisk -G /dev/disk/by-id/<new-disk>

At this point, we have 3 partitions on the drive. A blank one, the boot partition, and the ZFS partition. They exist, they're of the correct size and type, but they don't have any data in (or the data they do have will be whatever garbage data is on the drive from before). So, we need to populate it.

For the ZFS partition, this is simple, and it's the one I started first as it takes the longest. Now there's a ZFS partition to target, a simple zfs replace is possible. Replace the old partition with the new (note to select partition 3 on the new drive, else you'll have to start again), and ZFS will start copying the data over in the background. watch zpool status will show you the status, but it can take a while, especially for degraded drives.

For the boot partition, this is where we need proxmox-boot-tool. The first step is to format the partition. Sure, we could use cgdisk or alike, but Proxmox has us covered there too:

Bash Session

$ proxmox-boot-tool format /dev/disk/by-id/<new-disk>-part2

Once formatted, we can install a bootloader onto it. My installation uses systemd-boot, but if you have an older grub-based install, add grub to the end of the init command:

Bash Session

$ proxmox-boot-tool init /dev/disk/by-id/<new-disk>-part2

And that's it. Proxmox copies over the files it needs to, and ensures its internal list of boot partitions is updated with the new drive. On future updates, it'll be updated too.

If you run this, you'll likely receive some warnings (once it completes successfully). Proxmox keeps a list of all the drives it knows of with boot partitions. When we remove our faulted drive (which has happened already with the partition table snafu), Proxmox can't find the boot partition, and gets very unhappy. This is expected, but we need to tell Proxmox to update its registry. This can be done easily with proxmox-boot-tool clean. I'd recommend confirming Proxmox can see the 2 drives it expects with proxmox-boot-tool status before cleaning, so you're not left with only one drive with a boot partition - if that drive dies, things get far more complex!

<note>

This process worked fine for me on Proxmox 8. If you're reading this far into the future, I'd recommend looking at Proxmox's documentation to confirm it's still the correct course of action.

</note>

#Reviewing the damage

With the pool at least available and stable (enough for a day or so, at least), I could assess the damage.

#The dead drive

To have a bit of a look at what was going on, I connected the dead drive to my desktop to interrogate it. SMART immediately reported 5 pending sectors, so the drive was probably on its way out anyway, but dropping the partition table was very confusing.

Unfortunately, I know very little about drives, partition tables, or how they're mapped to bits on a disk, so I'm not going to forensicly examine the drive. But I did want to know quite how dead the drive was. So I ran a single pass of badblocks over it, to see how it performed and how much of the drive was damaged. I knew this would completely wipe the drive, but given I intend to RMA them anyway, and the pool was back mostly stable, there was no real risk.

Confusingly, badblocks reported the drive was absolutely fine. All data was written to and read from the drive correctly. It did slow down to a crawl towards the end, but it still wrote the data fine. I don't trust the drive enough to put it back in the pool and pray, but if I need a scratch SSD for some temporary data, at least there's something around.

#The data

When I first saw the faulted drive, I was filled with a feeling of dread. However, whilst the original drive was degraded, it was clearly still running, and there was no data loss.

Once I resilvered the temporary SSD in however, I ran a scrub, and the damage could be seen. Unfortunately, this is where the troubles began. zpool status reported a number of corrupt files - the worst thing anyone storing data can see.

Nothing prepares you for seeing this, but given the server was still (seemingly?) running fine, perhaps the data loss wasn't too critical. Through something which can only be described as pot-luck, most of the corruption was isolated to snapshots, and quite old ones at that. Of the files left on current versions, most were in package manager caches, and other files I don't really care about.

<tip>

When running zpool status with -v, ZFS helpfully shows exactly which files are lost, complete with the containing snapshot.

</tip>

#LXCs

Because I use LXC for most of my applications, ZFS can clearly see which files are corrupt.

Fixing the corruption for the package manager cache is simple - purge the cache. Its' just a cache, and trying to restore it just wasn't worth it. I could leave it, and wait for the next time I update, but I'd much rather not see the warnings. The remaining files were localised to shared libraries or other OS files which I hadn't customised, or didn't contain any data. To solve this, I worked out what package which installed the file in the first place, and force reinstalled it.

<tip>

On arch systems, this is pacman -Qo <file> to find the package, and pacman -S to install the file. Obvious, right?

</tip>

#VMs

Fixing VM images however was far more complex. ZFS only sees the entire disk volume as opaque data, so it's not possible to see the affected files. Worse still, there's no saying the impacted data impacts just a single file - it could be data critical to the entire partition layout.

In my case, it was just 3 fairly unimportant VMs. 2 only existed for testing something a while ago, so I just deleted them, and the final contained no data in itself, so I just restored from a few days before and re-ran my ansible playbook to make sure it was configured correctly.

#Checksum issues

When a drive starts failing, it'll likely fail in 2 main ways: Either it'll fail to read or write data at all, or it'll give you incorrect data (well, most of the time it's both). Most filesystems are comfortable at least detecting IO errors, but because ZFS keeps checksums of what the content is supposed to look like, it can detect incorrect data, too, and in many cases automatically correct it.

Around 2 years ago, I had an issue with some flaky SATA cables, which resulted in a fair few IO and checksum-related errors. I originally blamed it on a bad drive, and requested an RMA. After Seagate support suggested I try another cable, all the issues went away and they've not come back. The moral of that story? Don't by cheap cables on ebay and make sure your bend radius isn't too sharp.

When this sort of thing happens, it's not useful staring at a number of checksum issues from a previous setup - you want to reset ZFSs counters so you can keep monitoring. To achieve this, you can run zpool clear <pool> to reset the stats on the pool.

These stats are tied to the pool, so replacing a drive already resets the counters. However even after replacing the drives, I was still seeing a fairly large number of checksum issues, especially after doing a zpool scrub. The temporary drives I used are old, and it's not unfeasible that it was on its way out too, but it seemed rather unlikely to me.

It's at that point I decided enough was enough and looked towards a longer-term solution, but I was still confused. There were no IO errors of any kind, just checksum issues. And then, I realised why it was happening: I had broken files, of course there were going to be checksum issues!.

When ZFS reads a block from disk, it hashes its content and compares it against a checksum to confirm it looks correct. If it does, everything is good. If it's not, then something has gone wrong. ZFS then increments the CKSUM field in zpool status so the user knows its happening and can do something about it. Crucially, the checksums are stored completely separately from the data blocks themselves, and sometimes on separate drives (not relevant in my mirrored setup). The checksums exist to detect issues with the data blocks, which is exactly what I had. Whenever something attempted to read those blocks, the checksums didn't match, and ZFS reported an error. It doesn't matter that ZFS already knows the data isn't as it expects, it still flags to you that there's a checksum issue - because there is. With that out of the way, the panic of seeing the checksum number keep rising did subside, but I still had work to do.

#Long-term solution

There's damage in my pool, and a small amount of unimportant data loss - something had to be done. A week later than I probably should have, I ordered some new drives. The Crucial drives I had before had worked fairly well for me, and were designed for a fairly intensive workload. But I stumbled across some Sandisk drives with a larger endurance for the same cost (slightly cheaper even). So, I hit buy.

<aside>

Amazon offered them with same-day delivery, which still amazes me a little. Naturally, it didn't work, and the drives didn't show up until a few hours after I left the house for an event on Saturday, but it's still a cool dream nonetheless.

</aside>

I'm not a huge believer in the need to torture your drives before you start using them, especially for SSDs. A conventional workload isn't anywhere near that intensive, even for the VM boot drives, but I still wanted to do something to confirm I hadn't ordered 2 dud drives. If I resilvered them in, and both drives died, I was screwed. I opted to do a single round of badblocks, and confirm every sector could be written to and read from. The drives performed well, and all the data came back clean, so I started the resilver.

With new, stable drives installed, there was no (or at least as close to no as it's reasonable to get) further risk of data loss. The corrupt snapshots were still there, but the safety of my pool was restored. Eventually, after a month or so, sanoid would rotate out the old, now-corrupt snapshots leaving me with a perfectly restored pool with no data issues.

#Reflection

In hindsight, I should have listened to SMART. I'm not made of money, so buying drives as soon as there's even a blip in the SMART reports isn't realistic, nor necessary. Replacing a working drive with a working drive isn't going to achieve much. RAID is more about uptime than it is about backups, and it means a dead drive doesn't completely take down a system.

However, I should have started looking around for new drives, and got them on-hand, so when a drive started failing as far as ZFS was concerned, I could start replacing them. From my experience, if SMART reports an error, the drive can still go on living a while longer. Once ZFS starts reporting errors, especially if you've already replaced the cables, it probably means the drive is going to die fairly soon - so it's time to replace.

There's only so many times someone can tell you the correlation between drives failing and SMART failures. Are they a sign the drive is cursed and going to fail immediately, probably not. Are they an early warning signal of the drive misbehaving yes. Should they be ignored, not at all.

Share this page

Fixing a permissions error on Proxmox backup

2021-09-18

2 minutes

#backups #linux #server-2020

I have a bunch of VMs and LXCs on my proxmox server. Whilst I like to keep as little data in each of them as possible, and instead mount in my storage (in the form of a ZFS and snapraid pool), I still use proxmox’s built-in backup feature to back…

Server build 2020 - Proxmox setup

2021-05-03

12 minutes

#linux #self-hosting #server-2020

Back in December, I fully rebuilt my home server from the ground up based on Proxmox. Being a hypervisor OS, it makes sense to run everything in VMs or LXC containers, not on the host. Therefore, there’s a huge amount of scope for opinions, lessons and customizations. I’ve had quite…

Server CPU Replacement

2022-11-13