Software and hardware annotations 2008 February

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

080226 Tue Free software and the value of proprietary platforms

Today I was discussing usage patterns of MS-Windows and GNU/Linux with a bright, sensible scientists and he told me that a few years ago he was using GNU/Linux because he could get excellent free software on it, and only on it. But now he is using MS-Windows because all that excellent free software has been ported to MS-Windows and he has also the option of using all the MS-Windows packages too on the same computer without rebooting.
In other words he was using the argument that the value of a platform is in the value of the software available for the platform, and that a platform like MS-Windows on which both proprietary and free software packages are easily available is more valuable than one where only free software is available.
For advocates of free software like Richard Stallman porting free software to proprietary platforms is indeed a bad idea, because it just adds value to the proprietary platform, reducing the viability of the whole free software system. There are two ways to add value to competing platforms, and it does not matter whether either of the platforms are free software or proprietary:

Make valuable software previously only available on one platform to the competing platform. This is why for example game console suppliers are so keen on game title exclusives for their console platforms.
Make one's platform compatible with the competing platform (a more direct variant of the portability argument). Developers will then develop for the competing platform only, because that then opens access to the user bases of both.

However useful in the short term, projects like Cygwin, Gimp for MS-Windows, and Apache for MS-Windows add great value to MS-Windows as a runtime platform, Cedega adds a lot of value to MS-Windows as a (games) development platform. Cedega for example is likely to have greatly reduced the incentive for game developers to commission a GNU/Linux port of their product. In the same way OS/2 suffered greatly from its very good runtime compatibility with MS-Windows, as that removed most of the reason to develop OS/2 specific applications.
What should the supporters of a platform (proprietary or free) do then? Well, avoid supporting competing platforms by extending to them runtime or development compatibility. It is not by chance that Microsoft have been rather lacking enthusiasm for supporting the JVM on their platforms, and have made their browser rather less compatible with other browsers than it could have been. If they had done otherwise they would have enhanced the value of the JVM as a runtime platform and the value of standards based browsers as development platforms, and they own neither.
Sure, providing runtime or development compatibility for a platform that has few native applications may provide a short term increase in its popularity, but ultimately destroys it, because the only recipe for platform success is the hard one: to create an ecosystem of native applications and of developers that have invested in the platform.

080220 Wed A 48 drive setup, it had to happen

I have often been wondering how some people are setting up all those Thumper (X4500 boxes that Sun has been selling in large numbers, and here is one of the first sighting:

The box presents 48 drives, split across 6 SATA controllers. So disks sda-sdh are on one controller, etc. In our configuration, I run a RAID5 MD array for each controller, then run LVM on top of these to form one large VolGroup. I found that it was easiest to setup ext3 with a max of 2TB partitions. So running on top of the massive LVM VolGroup are a handful of ext3 partitions, each mounted in the filesystem. but we're rewriting some software to utilize the multi-partition scheme.

My reckoning is that the above is remarkably inappropriate:

Too wide parity RAID.
Likely that the DM/LVM2 volume group is a linear concatenation of the 6 underlying parity RAID sets. this is the both the least reliable and lowest performing combination.
Assuming the standard Thumper 500GB disks, each RAID set has a capacity of 3.5TB. This means that a bit over of the 2TB filesystem volumes will straddle a RAID set boundary, thus leading to total loss of the filesystem if either set fails.

In this setup the only saving grace is the splitting of the total capacity into 2TB slices, which is wise also because of file system check time concerns.

080219 Tue Switching my main work PC to a laptop

For the past few weeks a laptop (which I have just reviewed) has become my main working system. The main reason for this is that the 160GB disk capacity is large enough that I can put on it not just my home directory but enough of my archives (papers, documentation and software) that I can work standalone. That disk is also quite fast.
I am keeping my older, slower desktop which has a 250GB and a 500GB disk set, for backup and repository for less frequently used archives that I don't need that often or on the move (for example games).
Another reason why I can use the laptop as my main system is that its battery life is long enough, the screen is good enough, and the size is small enough that I can make use of it during a somewhat cramped commute, reclaiming a bit of time otherwise wasted, and then it is powerful enough (it is actually a bit faster than my desktop) to use it as home over Ethernet using my desktop as an X server.
Since I am somewhat conservative for geeks, that sort of means that laptops have taken over even my computer habits. Not fully yet, as I still need the desktops for demanding uses like large storage and games and random peripherals that need more electricity than a portable battery or bandwidth than a USB port can provide. Also, I am still acutely aware that as previously remarked laptops are not as easy, quick and cheap to self-repair as desktops, so extended outages are possible, and I shall keep a backup desktop for quite a long time.

080217 Sun Another RAID and volume management perversity

The Linux RAID mailing list sees a constant stream of amusing perversions, and today there is another one:

start w/ 3 drives in RAID5, and add drives as I run low on free space, eventually to a total of 14 drives (the max the case can fit). But when I add the 5th or 6th drive, I'd like to switch from RAID5 to RAID6 for the extra redundancy.

I'm also interested in hearing people's opinions about LVM / EVMS. I'm currently planning on just using RAID w/out the higher level volume management, as from my reading I don't think they're worth the performance penalty, [ ... ]

Apart from the the perversity, the questions above come without any hint as to the intended use and expected access patterns, without which it is difficult to offer topical comments, and that's typical too of many queries to any mailing list.
However, I will go out on a limb and here I list my general (without any context) advice on RAID and storage setup for GNU/Linux:

As a rule, assume that you don't know what you are doing. Anybody can use mdadm but that does not mean that they understand the performance and reliability implications. If you cannot quote from memory from about a dozen research papers on storage and filesystems usually you cannot assess those implications. if you think that this is elitism, go ahead and make your day.
If you don't know what you are doing, use RAID10 and Linux has a very nice RAID10 implementation (1, 2, 3). If you know what you are doing, you have already chosen to use RAID10 except in two cases).
Expect both occasional single drive failures, in the 3-10% per year range, and therefore perhaps once a year or two a multiple drive failure per array (array failures are not independent because of very many common modes).
Arrays with more than say 20 drives are not a good idea even with RAID10.
Usually don't bother subdivide your arrays, just use the block device directly; rather create multiple arrays than subdividing a large array.
If you have to subdivide your array, first partition the disks, and then build arrays on top of those partitions rather than viceversa.
Using DM (and its frontends LVM2 or EVMS) is usually pointless except for a few cases.
Worry a lot about backing up large arrays.
Unless you can afford a tape robot, assume that backup can only be done to a similar array.
Use JFS or XFS for arrays meant to contain large filesystems.
Don't build arrays larger than 5 to 10TB if you plan to put a filesystem in them.
If you really need parity based RAID, worry a lot about alignment and stripe size.
Use smaller chunk sizes than big chunk sizes.
Worry a lot about fsck times.
If you need to build a really large filesystem use something like Lustre rather than a large array.

080216 Sat A RAID and filesystem perversity

Another wonderfully amusing entry from the XFS mailing list:

I'm testing xfs for use in storing 100 million+ small files (roughly 4 to 10KB each) and some directories will contain tens of thousands of files. There will be a lot of random reading, and also some random writing, and very little deletion. The underlying disks use linux software RAID-1 manged by mdadm with 5X redundancy. E.g. 5 drives that completely mirror each other.

The entertainment value arises both because of the plan to use a filesystem as a simple database manager because of the RAID1 with 5 drives, and of the implication that a RAID1 with 1 drive has 1X redundancy. As to the use of of a filesystem as a database, here are some numbers from a particularly glorious example:

I have a little script, the job of which is to create a lot of very small files (~1 million files, typically ~50-100bytes each).

[ ... ]

It's a bit of a one-off (or twice, maybe) script, and currently due to finish in about 15 hours, hence why I don't want to spend too much effort on rebuilding the box. Would rather take the chance to maybe learn something useful about tuning...

and here is a comment on the results of using a simple database:

First, I have appended two little Perl scripts (each rather small), one creates a Berkeley DB database of K records of random length varying between I and J bytes, the second does N accesses at random in that database.

I have a 1.6GHz Athlon XP with 512MB of memory, and a relatively standard 80GB disc 7200RPM. The database is being created on a 70% full 8GB JFS filesystem which has been somewhat recently created:
----------------------------------------------------------------
$  time perl megamake.pl /var/tmp/db 1000000 50 100

real    6m28.947s
user    0m35.860s
sys     0m45.530s
----------------------------------------------------------------
$  ls -sd /var/tmp/db*
130604 /var/tmp/db
----------------------------------------------------------------

With some reasons why the filesystem alternative is not such a good idea:

The size of the tree will be around 1M filesystem blocks on most filesystems, whose block size usually defaults to 4KiB, for a total of around 4GiB, or can be set as low as 512B, for a total of around 0.5GiB.

With 1,000,000 files and a fanout of 50, we need 20,000 directories above them, 400 above those and 8 above those. So 3 directory opens/reads every time a file has to be accessed, in addition to opening and reading the file.

Each file access will involve therefore four inode accesses and four filesystem block accesses, probably rather widely scattered. Depending on the size of the filesystem block and whether the inode is contiguous to the body of the file this can involve anything between 32KiB and 2KiB of logical IO per file access.

It is likely that of the logical IOs those relating to the two top levels (those comprising 8 and 400 directories) of the subtree will be avoided by caching between 200KiB and 1.6MiB, but the other two levels, the 20,000 bottom directories and the 1,000,000 leaf files, won't likely be cached.

I am often flummoxed by the ability of people to dream up schemes like the above. In part because I am envious: it must be very liberating to be unconstrained by common sense, and have the courage to try and explore the vast space of syntatically valid combinations, even those that seem implausible to people like me chained by the yearning for pragmatism.

080210 Sun Some more data on filesystem checking speed

Usual points about fsck speed being an issue:

As a followup, a couple of years back when I was deploying U320 1TiB arrays at work, we filled each of them with ~800Gb of MP3s, forcibly powered down, and did fsck tests. ext3 was ~12 hours. reiserfs was ~6 hours. xfs was under 2 hours. XFS got used.

Yesterday I had fun time repairing 1.5Tb ext3 partition, containing many millions of files. Of course it should have never happened - this was decent PowerEdge 2850 box with RAID volume, ECC memory and reliable CentOS 4.4 distribution but still it did. We had "journal failed" message in kernel log and filesystem needed to be checked and repaired even though it is journaling file system which should not need checks in normal use, even in case of power failures. Checking and repairing took many hours especially as automatic check on boot failed and had to be manually restarted.

> I'll definitely be considering that, as I already had to wait hours for
> fsck to run on some 2 to 3TB ext3 filesystems after crashes. I know it
> can be disabled, but I do feel better forcing a complete check after a
> system crash, especially if the filesystem had been mounted for very
> long, like a year or so, and heavily used.

The decision process for using ext3 on large volumes is simple:

Can you accept downtimes measured in hours (or days) due to fsck?
No - don't use ext3.

There's no workaround for that. Do not ever ignore the need to run fsck
periodically. It's a safe thing to do. You can remount the xfs volume as
read-only and then run fsck on that - that's another thing to take into
account when setting things up.

That's for 1TB, 1.5TB, and 2-3TB filesystems, but currently many people would like to deploy 8TB filesystems (and some excessively brave chancers would like much larger filesystems).
Would it be then 20 hours? Admittedly a very large part of checking time is proportional to number of files, and many 8TB filesystems will contain much larger files than MP3s. But some filesystems will have lots of small files. In general I suspect that chunked file systems like Lustre are a good idea.
Note: this entry was mostly written in 0708.

080205 Slow transfer rate over SSH and improvements

More or less by default the SSH2 protocol has become the standard for simple inter-node operations, effectively replacing TELNET, RSH, FTP and has beocme the favourite proxy for X11, SOCKS, etc.; the main reason is not so much that it is secure, but that its implementations as OpenSSH and its MS-Windows equivalent PuTTY are so convenient: just the SSH agent makes it very easy to automagically connect to various hosts without prompts, and the reliance on a single TCP port allows easy traversal for a number of different proxied protocols, for example. These are considerable advantages over implementations of alternatives like TELENET over SSL.
The most significant problem with SSH is that when used as a transport for bulk data it is quite slow. The most recent example was a report that the otherwise very convenient WinSCP could only transfer around 1.7MB/s over a link measured as able to transfer around 100MB/s over TCP, so I checked and that was indeed the case. A web search confirms that this is indeed a common complaint. The main reason are:

The SFTP protocol is half-duplex and this particularly slow. It is much faster to use the SCP protocol with WinSCP (even if it has some serious limitations, most importantly non-restartability) or FISH as implemented by LFTP or Konqueror and others. However the best protocol over SSH, and arguably the best protocol for bulk data transfer overall, is RSYNC.
The default SSH encryption algorithms can take a lot of CPU time, and CPU time can cause relatively long pauses between packets. The solution is to use the least expensive ciphers. The quickest is arcfour which is reasonable even if it is weak over long sessions and the second cheapest is Blowfish which is recommendable for most uses. In particular many implementations of AES are rather slow (notably the one in WinSCP) and 3DES is usually very slow.
SSH is intrinsically a poor protocol for bulk data transfer, and most implementations use small packets with small buffers as that works well with SSH as a substitute for TELNET. There is a patch for the OpenSSH implementation that improves performance for bulk data transfer over SSH.

080202 Sat Coupling and active redundancy

Active redundancy as in clustering involves coupling -- because the exercise of the redundancy depends on communications. Ideally this does not cause problems, but in theory and in practice there are two big issues.
The theoretical big issue is that loss of communications cannot in the general case be distinguished from loss of redundancy, and even polling does not help, if there is not central synchronizing agent, and thus a single common mode of failure (the even bigger problem is that it is not widely undertstood that distributed computation without a central synchronizing agent belongs to a fundamentally different class of models of computation from those equivalent to Turing machines).
The practical problem is even worse: in practice it is extremely difficult to allow for communication and recovery among redundant elements without introducing common modes of failure, simply because it is very difficult to communicate and recover across entirely different designs and technologies, because communication and recovery require a high degree of compatibility.
Consider for example one of the most relied upon redundancy techniques, the RAID: nearly always the disks share the same location, power supply, cooling, receive the same commands from the same OS over the same bus, are subject to much the same access and wear patterns, and ridiculously enough are often of the same model and even from the same manufacturing batch.
More insidiously, when complicated communication protocols are used, it is often easiest to ensure compatibility by using the same software from the same manufacturer on all the members of the redundancy set. The compatibility requirement this creates couplings in many cases. For example it is fairly common for manufacturers to state that clustered systems only work reliably if the same software product of the same version is used, thus guaranteeing that the set of bugs is exactly the same across all supposedly redundant instances.
This is a general problem, and I remember reading that in effect in modern experimental physics independent verification of results is far less easy than repeating an experiment somewhere else, because physicists tend to share and reuse the same software codes and products.