Sunday, September 26, 2010

The Sagemath Cluster's Fileserver

In addition to being involved in programming in Sage, I am also the systems administrator for the "Sagemath cluster" of computers, which are used for Sage development, and also support mathematics research. This is a cluster of 5 machines with 128GB RAM and 24 cores in each machine, along with a 24 terabyte Sun X4540 fileserver.

When we setup this cluster nearly 2 years ago, we installed OpenSolaris and ZFS on the fileserver. It's run very solidly, in that the uptime was over 600 days a few days ago. However, no software would really work well on OpenSolaris for me -- top crashed usually, emacs crashed, stuff just didn't work. I never got Sage to build. Also, USB disks were a major pain on this machine, which complicated backups. Also, I frankly found the performance of ZFS and NFS-from-ZFS to be disappointing. In addition, nobody had a clue how to do maintenance and security updates on this server, meaning it was probable a danger.

Then Sun got bought by Oracle, and Oracle killed OpenSolaris. Also, I started getting really into MongoDB for database work related to the modular forms database project, and it would be nice to be able to run a MongoDB server and other related software and servers directly on my fileserver (yes, MongoDB supports Solaris, but...). Getting anything to work on Solaris costs too much time and confusion. So I decided to delete OpenSolaris and install Linux. Since there's many terabytes of data on the fileserver, and dozens of people using it, this involved many rsync's back and forth.


I eventually succeed in installing Ubuntu 10.04.1 LTS Server on disk.math. I setup a big 20TB software RAID-5 array on disk.math (with 2 spares), and added it to an LVM2 (=logical volume management) volume group.

I created 3 partitions:

        home -- 3.5 terabytes:  for people's home directories
        scratch -- 3.5 terabytes;  for scratch that is available on all machines on the cluster
        lmfdb -- 3.5 terabytes;  for the L-functions and modular forms database project.

Thus over 7.5 terabytes is not allocated at all right now.  This could be added to the lmfdb partition as needed.

The RAID-5 array has impressive (to me) performance:

   root@disk:/dev# /sbin/hdparm -t /dev/md0
     /dev/md0:
     Timing buffered disk reads:  1638 MB in  3.00 seconds = 545.88 MB/sec

All the above 3.5TB partitions are NFS exported from disk.math, and (will all be) mounted on the other machines in the cluster.

By using LVM, we will still get snapshotting (like ZFS has), which is important for robust backups over rsync, and is great for users (like me!) who accidentally delete important files.

I chose 3.5TB for the partitions above, since that size is easy to backup using 4TB external USB disks. Now that I'm running linux on disk.math, it will be easy to just plug in a physical disk to the fileserver and make a complete backup, then unplug it and swap in another backup disk, etc.

----

Some general sysadmin comments about, in case people are interested.

   (1) Oracle shutdown access to firmware upgrades, which meant I couldn't upgrade the firmware on disk.math.  Maybe it would have been possible via phone calls, etc., but I was so pissed off to see Oracle do something that lame.  It's just evil to not give away firmware upgrades for hardware.  Give me a break.  Oracle sucks.   I'm really glad I just happened to upgrade the firmware on all my other Sun boxes recently.

   (2) The remote console -- which is needed to install from a CD image on disk.math -- does not work correctly on 64-bit OS X or 64-bit Linux.    It just fails with "not supported on this platform" errors, but with no hint as to why.   By installing a 32-bit Linux into a virtual machine, I was able to get this to work.

  (3) Linux NFS + software RAID + Linux LVM2 (=logical volume management) does roughly "the same thing" as ZFS did on the old disk.math machine.   I spent about equal time with the documentation for both systems, RAID+LVM2+NFS and ZFS, and the documentation for RAID+LVM2+NFS is definitely less professional, but at the same time vastly more helpful.  There's lots of good step-by-step guides, forums, and just generally a *ton* of useful help out there.  With the ZFS documentation, there is basically one canonical document, which though professional, is just really frustrating as soon as you have a question not answered in it.     I personally greatly prefer the Linux ecosystem.

  (4) RAID+LVM2+NFS is much more modular than ZFS.   Each part of the system can be used by itself without the other parts, each has its own documentation, and each can make use of the other in multiple flexible and surprising ways.    It's very impressive.    One of the reasons I chose ZFS+Solaris instead of Linux+whatever 2 years ago is that I didn't realize that Linux offers similar capabilities.... because it didn't back in 2001 when I was learning tons about Linux.  But it does now.  Linux on the server has really, really improved over the years.   (I've been using Linux since 1993.)