It is currently Sat Dec 21, 2024 6:21 pm

Minecraft server post mortem

Minecraft isn't just a chip off the ol' block. It's an addictive and deceptively complicated construction sandbox game. Play in survival mode (multiplayer, too!) while building fortifications to protect against zombie hordes--and other baddies that appear during the night--or try your hand at massive projects in creative mode. We run a local Minecraft server instance for guild members and others on an invite-only basis. See the sticky posts in this forum for details!
Forum rules
This forum is specifically intended for Minecraft-related discussions.

Minecraft server post mortem

Postby Zancarius » Tue Jun 07, 2016 10:42 pm

As you no doubt saw from this thread, the drive on my file server at home started exhibiting surface faults (bad blocks) toward the end of the disk where I had created a new container for the Minecraft instance to live. The faults were not immediately apparent when the data was first copied, but the extent of the SMART-reported reallocated sector count leads me to believe that the faults were sufficiently numerous to overwhelm the drive's own protection mechanisms. Sadly, much of the Minecraft world was destroyed in the process, and the only backup I had available was around 4-6 weeks old (shame on me). Fortunately, Minecraft was played so rarely that it wasn't much of a concern.

However, this leads us to a few issues to consider. Firstly, if the drive reports a block as functional, which later fails, or fails in a manner that the data is left corrupted (either silently or to an extent we're unaware until later failures make it more obvious), most existing file systems will do nothing to indicate such a failure until the drive reports the block as failed (remapping it in hardware or software). Secondly, incremental backups almost always require third party software if it's not supported by the file system, and while that's usually a good idea, it's handy to have as many tools at your disposal for such a purpose as possible.

Here's what I'm doing to prevent this from happening in the future:

I've migrated my home server both to a new drive for the time being until I can get a bit more hardware set up to either replicate the data or setup a software RAID for extra redundancy (RAID is not a backup--but it can buy you enough time to diagnose the problem until the data can be restored from backup) and to a new file system. I'm using btrfs at the moment which provides a few advantages (as well as disadvantages; more on that later) over ext4, which I was previously using on the failed disk. btrfs is new and still technically experimental; it's not as good as ZFS which has decades of development behind it nor is it as fast as XFS. However, btrfs does provide some features that were once only available on ZFS, like data checksum support, subvolumes, and some degree of self-healing capabilities. The advantage is that even if silent corruption of the data occurs, the file system will alert on access that the checksums are invalid and the data is suspect.

Second, I'll be using either btrfs' subvolume snapshot feature for incremental backups, allowing me to create a byte-level incremental backup, or continue using my old rsync scripts. I'd rather do the former, but I'd like to test it to guarantee that incremental snapshots are relatively easy to recover given the existing hardware and software I have available. Byte-level snapshots are also smaller than what you can do with rsync which tends to copy entire files if they've changed rather than the actual data that was written to those files responsible for changing them. While I could use binary differencing software, it's another link in the chain and tends to be the most difficult to reasonably apply in reverse to build an image of the data at any point in time. My limited experience with stand alone binary diffing packages has been far less than stellar, yet the algorithms have been used successfully in a number of different things over the years.

Finally, I'm planning on replacing the old server tower this hardware is sitting in with something that actually fits on my equipment rack. I'm suspicious that perturbations from cleaning or other disruptions near the box itself create oscillations due to its height that may have propagated into the hard drive thus contributing to the damage we've seen.

As far as the Minecraft world goes, everything should be functional now. I chose to restore from an earlier backup, because recovery of large swaths of the world was somewhat impractical. Unfortunately, this may have reverted some of your work. Matt started building on a small island that was hit hardest, so I've restored his inventory and created a chest where he was last logged in that contains some items to make up for the losses he may have incurred in the process.

Sorry about that guys. While I regularly back up the critical data on my server, game-related stuff often slips through the cracks.
I gave that lich a phylactery shard. Liches love phylactery shards.
User avatar
Zancarius
Site Admin
 
Posts: 3907
Joined: Wed Jul 05, 2006 3:06 pm
Location: New Mexico
Gender: Male

Return to Minecraft

Who is online

Users browsing this forum: No registered users and 1 guest