Software and hardware annotations 2008 March

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

080315 Sat Outsourcing computer service
I was recently discussing with a couple of smart colleagues the merits of contracting out computer services, especially for a research institution that has both soft office style requirements and hard operational ones. Part of my argument was that contract management is often not much less demanding than project management, as some experiences we have had have shown. Another was that the concept of computer utility is 40 years old and that timesharing systems, starting with Multics, never quite became computer utilities. Oh well, they did, but not to the extent that electricity and water and telephones did, and in many places even electricity and water and telephone supplies are not that reliable or economical, and those who can afford it do their own. Sure, when they are reliable one can indeed rely on them, and there are businesses who have eliminated the hassle and overheads of having their own internal telephone system and switched entirely to cellphones.
Sure, there are now Internet based computer utilities, ranging from Amazon's S3 storage utility to Google's e-mail utility e-mail utility GMail to the contact and workflow utilities by several ASPs.
Then I realized that there are two reasons why even in first world countries that still have reliable water and electricity and phone utilities there are issues with the computer utility concept. The first one is that ordinary utilities supply a delivery service, where their output is sent to the customer, but computer utilities are based on the opposite concept, that the costumers send their data to the utility for processing.
Computer utilities are more like banks. When your electricity utility fails you lose your electricity supply, and it takes a while but you can start your backup generator or start buying electricity from another supplier, but when your bank fails your lose your money, and it is gone, and the new account in your next supplier of banking services will be empty. Well, data fortunately can be copied more easily than money, so before depositing your business data in a computer utility you can back them up locally and continue to do so, unlike for your business capital. But keeping your data both in a computer utility and in a local backup greatly reduces the convenience of using the utility, just like having to keep an electricity generator running all the time in case the utility electricity supply were to fail.
Sure, perhaps Amazon.com or Google or even SalesForce.com have demonstrated staying power, but even banks have a tendency to disappear suddendly with their customers' money, and there are scary statistics as to the destiny of businesses who lose their data. The saving grace here for the computer utility argument is that many businesses have local computer and data systems that are less reliable than an external supplier, but not as terrible is hardly reassuring as a selling point.
The second reason why computer utilities are still not as popular as water, electricity, or phone utilities (or banks) is that computing services are not as simple and standardized, in other words that the market for computing service supply is not liquid. Computing power may be a commodity, but its delivery as a service is not as fungible and tradeable as one; it is not easy to compare different offers as to value, and it is not easy to switch suppliers and accept a better offer. It has taken decades for water, power and phone services to become standardized, and this happened largely because they were nationalised; it has then taken decades for them to be denationalised and then for markets to develop.
There are some long term compelling economics favouring the delivery of computing power as a service: The last two points mean that it has become recently more cost effective to put servers near a power plant and clients far away from the servers than viceversa. In the long term this is going to favour asset owners in first world countries, who own much better power and water plants than in the rest of the world, and workers in third world countries, who have much lower living costs (and standards).
080303 Mon Linux block device performance issues
I have recently been doing some quick speed tests for a RAID setup under several conditions, and I was astonished that in some cases I was getting several times slower read rates then write rates, under GNU/Linux with software RAID.
The tests were about an 8-drive RAID10 using two sets of LSI MegaRAID host adapters on a 2×4 CPU system with lots of memory and very fast PCIe bus, and each disk individually capable of around 85-103MB/s. Then I saw write rates of around 250MB/s (reasonably good especially for a -p f2 array) and read rates of 40-60MB/s, but variable within a significant margin. Using
watch iostat 1 2
to check the spread of activity across the drives I could see either well balanced low rates of 5-7MB/s per drive, or unbalanced rates, almost arbitrarily. When I measured the single drives I got pretty good performance. At their simplest the array and drive tests were done with just
sysctl vm/drop_caches=3
dd bs=4k if=/dev/sdc of=/dev/null
but I also used for more varied tests Bonnie 1.4. (with -o_direct of course). Anyhow just dd as in the above is sufficient to get a coarse but reasonable idea. It is also useful to watch vmstat 1 while doing tests.
At some point I remembered a report that to obtain good RAID performance the block device read-ahead had to be set to a very large value so I started testing various values, both on the RAID block device file itself (/dev/md0) and the underlying disk block devices, with a command like
blockdev --setra 65536 /dev/md0
and also with smaller values. The results were that with read-ahead less than 64 the underling disk drives would perform at less than their top rate:
tree$ sudo sh -x /tmp/testra
+ for N in 1 4 16 32 64 128
+ blockdev --setra 1 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 6.94646 s, 29.5 MB/s
+ for N in 1 4 16 32 64 128
+ blockdev --setra 4 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 6.93524 s, 29.5 MB/s
+ for N in 1 4 16 32 64 128
+ blockdev --setra 16 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 4.11668 s, 49.7 MB/s
+ for N in 1 4 16 32 64 128
+ blockdev --setra 32 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 3.88555 s, 52.7 MB/s
+ for N in 1 4 16 32 64 128
+ blockdev --setra 64 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 3.53645 s, 57.9 MB/s
+ for N in 1 4 16 32 64 128
+ blockdev --setra 128 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 3.58348 s, 57.2 MB/s
and the same happened to the RAID10 device itself but got values smaller than 64k. At some point I noticed something entirely unexpected: in the vmstat output the number of interrupts was inversely proportional to the value of the read-ahead. Since I was doing my tests on an otherwise idle machine and there were very few interrupts when I was not running dd this obviously correlated to RAID operations. After some more testing with dd, using various combinations of read-head (--setra) and read length (bs=) and watching with vmstat my summary was: The last detail especially gives away the game: the Linux block IO susbsystem obviously uses the read-ahead value as a blocking factor, turning it into the unit of IO requests devices. Even worse, it looks like IO requests are issues a multiple of 10 times a second, probably on clock ticks, not queue depletion. All this is quite bad becase read-ahead ought to be adaptive, and anyhow issuance rate should depend on overall application request rate and device completion rate.
Much worse, the block subsystem seems to run read requests (but not write ones) in a sort of half-duplex way, waiting for the completion of the previous read-ahead sized request instead of streaming them. This is rather bad because it is the obvious cause of the lower performance the lower the blocking factor is: the time wasted waiting for the previous request to complete and to issue a new request prevents the disk from running block reads back-to-back,
Now the last point is appropriate in the case of regular ATA and SATA: that's how the host adapter works, it can only process one request at a time for N sectors and report completion of the whole request. But it is very bad in the case of RAID host adapters, as it is to be expected that they be able to do mailboxing usually in its tagged queueing variant.