Software and hardware annotations 2008 March

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

080315 Sat Outsourcing computer service

I was recently discussing with a couple of smart colleagues the merits of contracting out computer services, especially for a research institution that has both soft office style requirements and hard operational ones. Part of my argument was that contract management is often not much less demanding than project management, as some experiences we have had have shown. Another was that the concept of computer utility is 40 years old and that timesharing systems, starting with Multics, never quite became computer utilities. Oh well, they did, but not to the extent that electricity and water and telephones did, and in many places even electricity and water and telephone supplies are not that reliable or economical, and those who can afford it do their own. Sure, when they are reliable one can indeed rely on them, and there are businesses who have eliminated the hassle and overheads of having their own internal telephone system and switched entirely to cellphones.
Sure, there are now Internet based computer utilities, ranging from Amazon's S3 storage utility to Google's e-mail utility e-mail utility GMail to the contact and workflow utilities by several ASPs.
Then I realized that there are two reasons why even in first world countries that still have reliable water and electricity and phone utilities there are issues with the computer utility concept. The first one is that ordinary utilities supply a delivery service, where their output is sent to the customer, but computer utilities are based on the opposite concept, that the costumers send their data to the utility for processing.
Computer utilities are more like banks. When your electricity utility fails you lose your electricity supply, and it takes a while but you can start your backup generator or start buying electricity from another supplier, but when your bank fails your lose your money, and it is gone, and the new account in your next supplier of banking services will be empty. Well, data fortunately can be copied more easily than money, so before depositing your business data in a computer utility you can back them up locally and continue to do so, unlike for your business capital. But keeping your data both in a computer utility and in a local backup greatly reduces the convenience of using the utility, just like having to keep an electricity generator running all the time in case the utility electricity supply were to fail.
Sure, perhaps Amazon.com or Google or even SalesForce.com have demonstrated staying power, but even banks have a tendency to disappear suddendly with their customers' money, and there are scary statistics as to the destiny of businesses who lose their data. The saving grace here for the computer utility argument is that many businesses have local computer and data systems that are less reliable than an external supplier, but not as terrible is hardly reassuring as a selling point.
The second reason why computer utilities are still not as popular as water, electricity, or phone utilities (or banks) is that computing services are not as simple and standardized, in other words that the market for computing service supply is not liquid. Computing power may be a commodity, but its delivery as a service is not as fungible and tradeable as one; it is not easy to compare different offers as to value, and it is not easy to switch suppliers and accept a better offer. It has taken decades for water, power and phone services to become standardized, and this happened largely because they were nationalised; it has then taken decades for them to be denationalised and then for markets to develop.
There are some long term compelling economics favouring the delivery of computing power as a service:

Computing services, at least at the low or very high end, are becoming slowly more standardized. They are not becoming interoperable (the very slow takeup of something as banal as web services shows that, never mind that of grid computing) but at least they are becoming look-alike, a bit like cars, and thus more fungible.
Thanks to the enormous amount of money sunk into building up communications capacity during the dot.com boom, delivery over long distances of computer access is much easier and cheaper than delivery over long distances of electricity and water. Except of course when lots of data need to be exchanged over large distancs.
Power and cooling scale much better than computing power itself. The lifetime cost of power and cooling is currently bigger that that of a computer in many computing centres.

The last two points mean that it has become recently more cost effective to put servers near a power plant and clients far away from the servers than viceversa. In the long term this is going to favour asset owners in first world countries, who own much better power and water plants than in the rest of the world, and workers in third world countries, who have much lower living costs (and standards).

080303 Mon Linux block device performance issues

I have recently been doing some quick speed tests for a RAID setup under several conditions, and I was astonished that in some cases I was getting several times slower read rates then write rates, under GNU/Linux with software RAID.
The tests were about an 8-drive RAID10 using two sets of LSI MegaRAID host adapters on a 2×4 CPU system with lots of memory and very fast PCIe bus, and each disk individually capable of around 85-103MB/s. Then I saw write rates of around 250MB/s (reasonably good especially for a -p f2 array) and read rates of 40-60MB/s, but variable within a significant margin. Using

watch iostat 1 2

to check the spread of activity across the drives I could see either well balanced low rates of 5-7MB/s per drive, or unbalanced rates, almost arbitrarily. When I measured the single drives I got pretty good performance. At their simplest the array and drive tests were done with just

sysctl vm/drop_caches=3
dd bs=4k if=/dev/sdc of=/dev/null

but I also used for more varied tests Bonnie 1.4. (with -o_direct of course). Anyhow just dd as in the above is sufficient to get a coarse but reasonable idea. It is also useful to watch vmstat 1 while doing tests.
At some point I remembered a report that to obtain good RAID performance the block device read-ahead had to be set to a very large value so I started testing various values, both on the RAID block device file itself (/dev/md0) and the underlying disk block devices, with a command like

blockdev --setra 65536 /dev/md0

and also with smaller values. The results were that with read-ahead less than 64 the underling disk drives would perform at less than their top rate:

tree$ sudo sh -x /tmp/testra
+ for N in 1 4 16 32 64 128
+ blockdev --setra 1 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 6.94646 s, 29.5 MB/s
+ for N in 1 4 16 32 64 128
+ blockdev --setra 4 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 6.93524 s, 29.5 MB/s
+ for N in 1 4 16 32 64 128
+ blockdev --setra 16 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 4.11668 s, 49.7 MB/s
+ for N in 1 4 16 32 64 128
+ blockdev --setra 32 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 3.88555 s, 52.7 MB/s
+ for N in 1 4 16 32 64 128
+ blockdev --setra 64 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 3.53645 s, 57.9 MB/s
+ for N in 1 4 16 32 64 128
+ blockdev --setra 128 /dev/sda
+ sysctl vm/drop_caches=1
vm.drop_caches = 1
+ dd bs=4k count=50000 if=/dev/sda of=/dev/null
+ grep MB/s
204800000 bytes (205 MB) copied, 3.58348 s, 57.2 MB/s

and the same happened to the RAID10 device itself but got values smaller than 64k. At some point I noticed something entirely unexpected: in the vmstat output the number of interrupts was inversely proportional to the value of the read-ahead. Since I was doing my tests on an otherwise idle machine and there were very few interrupts when I was not running dd this obviously correlated to RAID operations. After some more testing with dd, using various combinations of read-head (--setra) and read length (bs=) and watching with vmstat my summary was:

The number of interrupts was indeed directy proportional to the amount of read-ahead, but not to the read length.
On the underlying disk device small values of read-ahead produced reading rates well below the capacity of the disk device, and the same for the RAID10.
Raising the read-ahead on the underlying block got dd to read faster, but the maximum speed of the read device was reached quickly, with a value of 64; thereafter only the number of interrupts would decrease.
Raising the read-ahead on the RAID10 device would linearly decrease the number of interrupts, and increase, but rather erratically the streaming read rates, and at about 64k the read rate was consistently as expected.
The effect of increasing read-ahead was exactly the same not just on the underying disks of the RAID10 array, but also on the plain SATA disk my laptop and desktop, incouding performance, and number of interrupts.
Changing the size of read requests using different valur of bs= for dd had instead essentially no effect compared to changing values of read-ahead.

The biggest surprise was that the bi column of the output of vmstat seemed to be an exact multiple of the read-ahead, which I noticed when I set the read-ahead to 1000 and other powers of 10:

$ sudo blockdev --setra 1000 /dev/sda
$ sudo dd bs=4k count=500000 if=/dev/sda of=/dev/null & vmstat 1
[2] 3697
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  1      0 330772 648812 727796    0    0    21     5   91  104  3 11 86  0  0
 2  2      0 330056 664148 727340    0    0 50464     0  304  425 12 44  0 45  0
 1  2      0 330932 682976 720968    0    0 51000     0  299  415 11 45  0 44  0
 1  2      0 330148 700512 715600    0    0 48500     0  287  423 12 43  0 46  0
 1  2      0 330556 716128 710148    0    0 50000     4  300  432 12 41  0 48  0
 1  2      0 330228 732268 705280    0    0 49000     0  295  409 12 43  0 46  0
 1  2      0 330880 744936 702148    0    0 48500     0  286  419 10 47  0 44  0
 1  2      0 329916 758820 699220    0    0 49500     0  289  430 16 39  0 46  0
 1  2      0 330092 770968 696012    0    0 49000     0  289  425 11 43  0 47  0
 1  2      0 330808 781252 693556    0    0 48500     0  289  415  7 48  0 46  0
 1  2      0 336416 792724 684552    0    0 50500     0  295  414  7 50  0 44  0
 1  2      0 845892 810656 163568    0    0 48500     0  308  511 12 45  0 43  0
 1  2      0 868672 819836 139180    0    0 50000     0  340  558 11 42  0 48  0

That's uncanny, down to some lines having a multiple of 500 instead of 1000.

The last detail especially gives away the game: the Linux block IO susbsystem obviously uses the read-ahead value as a blocking factor, turning it into the unit of IO requests devices. Even worse, it looks like IO requests are issues a multiple of 10 times a second, probably on clock ticks, not queue depletion. All this is quite bad becase read-ahead ought to be adaptive, and anyhow issuance rate should depend on overall application request rate and device completion rate.
Much worse, the block subsystem seems to run read requests (but not write ones) in a sort of half-duplex way, waiting for the completion of the previous read-ahead sized request instead of streaming them. This is rather bad because it is the obvious cause of the lower performance the lower the blocking factor is: the time wasted waiting for the previous request to complete and to issue a new request prevents the disk from running block reads back-to-back,
Now the last point is appropriate in the case of regular ATA and SATA: that's how the host adapter works, it can only process one request at a time for N sectors and report completion of the whole request. But it is very bad in the case of RAID host adapters, as it is to be expected that they be able to do mailboxing usually in its tagged queueing variant.