Computing notes 2012 part four

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

121228 Fri: Step-by-step is rather lacking

I have been refreshing some aspects of the configuration of my home desktop/server and laptop and since it is slightly tricky (Kerberos/NFS4) I have been frustrated.

One reason is the usual lack of relevant detail in error messages but the other is that most online documentation presents is an HOWTO as a set of steps.

What is really useful is a description of the result to aim for and only after that is detailed the details of one or more possible way to achieve it.

Because a step-by-step HOWTO can be error prone, and even if one wants to follow it one should check at the end that it has achieved the correct result. Indeed just knowing what is the correct configuration is invaluable in investigating issues.

But I have heard many sysadms argue that it is pointless, because they would never read that, as I don't want to think, I don't want to understand, I just want to get it done and when it works I don't care what is the correct configuration; I don't sympathize with that attitude at all, because it relies on blind luck. and it is based on the assumption, which most managers seem to share, that there is a silver bullet: that bycanning the relevant steps can be executed without skill or thinking.

In the present case I have written a list of what is needed to get NFSv4 work with Kerberos in part from my own experience and in part from consulting a number of HOWTOs (listed in the above link).

121223b Sun: Using flash SSD drives instead of 10K/15K RPM disk drives

Thanks to the continuing fall in flash SSD prices I think they are now cheap enough to be a viable alternative to 15K and 10K RPM drives in servers.

Currently I see SAS 146GB 15K drives for around $190 (ex. tax) and SATA 150GB 10K drives for $110 (ex. tax) and SATA 120GB or 128GB flash SSDs for around $110 (ex. tax) (1, 2).

The flash SSD drives are consumer-grade ones, therefore with less endurance, and also with less backup power, so more prone to lose cached data on case of power loss, but they seem to be pretty adequate, and their random IOPS are much better in most tests than tbose of any disk, with several tens of thousands of random IOPS.

The newer flash SSD drives seem far more reliable than those of just 2-3 years ago, both as to endurance and performance over time, but their main problem is they don't have the same length of record as 10K and 15K drives.

But they have several advantages, and apart from much better random performance, flash SSDs are mechanically far more robust, being largely immune to vibration which affects disk drives quite a lot, and obviously they don't generate any vibration, and are entirely silent, which can be an advantage too, as seeking disk drives can be very loud, even if usually most noise from servers is from the tiny fans, esepcially those with the motor in the middle.

It seems plausible to use them nearly routinely on servers running replicated services and to keep using disk drives:

If they are 15K enterprise drives when budgets allows buying several of them for a RAID set, and there is a need for absolutely no surprises 3-5 years later.
When maximum capacity is needed, use 3.5in large capacity drives, which still have unbeatably low cost per TB, even if they are slow and without some of the features, even if large nearline SATA and SAS drives tend to have those features for a now modest price premium over consumer ones.

121223 Sun: Slicing a single RAID set or multiple RAID sets

Having recently mentioned slicing resources in separate pools and keeping one (or more) in reserve for maintenance and upgrades, it occurred to me to mention a less optimal example of slicing, which seems very common.

The example is a file server with N×MTB disks available for the file storage area. The disks have been setup as a single RAID6 set of those N disks, and since there is no requirement for a single large free storage pool, the resulting block device has been sliced into N-2 MTB partitions each with its own filetree.

This setup has slices of the same RAID6 set, which has the following issues:

In the particular case the slices were not used in a defined patterns, most being lightly used in a scattered way.
Since with RAID6 writing (usually) requires reading the whole stripe to update the parity and syndrome parts, performance is highly correlated, in the sense that writing to one of the slices impacts all of them even if the others are only being read.
Since all slices are on the same RAID6 set, just at different offsets, parallel fscks will interfere.
Maintenance of the RAID6 set impacts all slices.
The level of redundancy in that particular case was too low (16 disk RAID6), and shared across all slices, so that loss of redundancy would impact all of them.

In the specific case not much of the above mattered, as the load on the fileserver was small, and it took 6-7 hours to fsck all slices one after another, which was acceptable, and since all the hardware component have had low failure rates, in several years the RAID set has not suffered a loss of data.

But for an application profile with higher requirements it might have been more appropriate to divide the available drives into RAID1 pairs, RAID5 triples, or RAID10 (or even RAID6...) quartets, each of which to contain a fully independent filetree, and to fill them in sequence.

121218b Tue: "Crop" rotation and fallow "fields" in computing

It is often necessary to provide computing services via infrastructures that are designed together, and over the years I have found that a very good and nearly always applicable overall strategy is very similar to crop rotation and fallow fields which was discovered relartively recent to be useful in agriculture.

The idea is to split the resources into separate pools, to leave some pools unused, and to rotate the load on the pool longest in use into one of the unused ones, and then to make the longest used pool into an unused one and rebuild it.

A good example was the storage system for a large data acquisition load. Because of external pressure a large centralized system had been procured, and I felt it would be a bad idea to commit the whole capacity of the system to a specific set of technologies (RAID type, filesystem type, network storage protocol, ...), especially as the type of load from the various data acquisition systems that would be using it was not yet known.

A very smart colleague suggested dividing the setup into some slices, for example 8, and configuring only the first; then when it filled up, configuring the next slice according to the experiences of the previous one, and so on, and then when all 8 slices had filled, to reuse the first one.

This was an excellent strategy to avoid overcommitting to an initial guess about future needs. and for a storage system for data acquisition it was particularly suitable, as the filled up slices could be made read-only, thus avoiding a lot of issues, including most reducing the need for fsck.

A similiar technique is used with great advantage in the data center the Sanger Institute which is divided in four computer rooms, of which three contain operating infrastructure, and the fourth is used to assemble a new infrastructure with the latest products, and when that is ready, load is transferred to it from the oldest previous slice, and that is emptied to be ready for the next upgrade:

4x250 M2 Data centres.

2-4KW / M2 cooling.

1.8 MW power draw

1.5 PUE

Overhead aircon, power and networking.

Allows counter-current cooling.

Focus on power & space efficient storage and compute.

Technology Refresh.

1 data centre is an empty shell.

Rotate into the empty room every 4 years and refurb.

“Fallow Field” principle.

I have been following this strategy for many years at home, where I have 2-3 slices (each consisting of a laptop and server and accessories), and every 2 years or so I add a new slice with recent products and discard the oldest slice.

This is yet another reason why I reckon that having multiple smaller diverse pools of resources is better than having a large homogenous one, as they can be rotated into maintenance and upgrade phases in a staggered fashion, and thus minimize the impact of maintenance and upgrades, allowing them to be more frequent.

121218 Tue: Long storage cluster restart times

There are consequences of a centralized storage and VM system.

The story is that there was a sudden power loss after a couple of years of uneventful operation.

This resulted in more than 24 hours of storage layer checking and resyncing time, plus several hours of fsck time both at the overall filesystem level and during VM startup time for their internal drives.

Since there were the virtual disks of more than 100 virtual machines on that storage system, comprising most of the services of the organization, a relatively short period of power loss resulted in a couple of days of widespread unavailablity of system infrastructure.

The story is not unexpected and it confirms my preference for fairly small, diverse, replicated infrastructures and dislike of virtual machines, as:

I tend to worry a more about impact of issues, and thus dealing with dependency risk, than their frequency, especially as hardware seems to have improved in reliability over the years, but external issues (power, cooling, mistakes) have not improved that much.
There is rarely the need for a single pool of resources, whether for storage or computing power; and for many services one can overprovide hardware, especially considering the cost of unavailability.
Maintenance of several diverse smaller infrastructures is much more flexible and timely than that of a single monolithic one, because of the more limited impact of that maintenance.

The overall strategy is that of an internet of relatively small and self contained and local computing infrastructures for small user communities (my ideal size is for each 20 users in many cases), rather than a centralized infrastructure shared by many users, because the internet model is flexible and scales. This depending on the resource cost of the service, where relatively cheap services (for example user directories) can be more centralized than relatively costly ones (for example search engines).

121215 Sat: Interleaved read/write transfer rates on SSD

While reading some transfer rate tests for recent flash SSDs I was amused that interleaved read/write tests usually are not done, and yet they are very important, because in flash memory there is a very large difference between reading and writing, and strategies that help with one don't help with another.

A very simple test is to first write a file, then read it, then copy it and compare. My Crucial M4 behaves fairly well, as on a SATA2 interface that has a maximum transfer rate of 260-270MBs (below the peak transfer rate of the SSD) behaves like this:

#  dd bs=64k count=16000 conv=fsync if=/dev/zero of=TEST
16000+0 records in
16000+0 records out
1048576000 bytes (1.0 GB) copied, 4.97284 s, 211 MB/s

#  sysctl vm/drop_caches=1
vm.drop_caches = 1
#  dd bs=64k count=16000 conv=fsync of=/dev/null if=TEST
dd: fsync failed for `/dev/null': Invalid argument
16000+0 records in
16000+0 records out
1048576000 bytes (1.0 GB) copied, 4.0207 s, 261 MB/s

#  sysctl vm/drop_caches=1
vm.drop_caches = 1
#  dd bs=64k count=16000 conv=fsync of=TEST2 if=TEST
16000+0 records in
16000+0 records out
1048576000 bytes (1.0 GB) copied, 8.29767 s, 126 MB/s

This test was run on an otherwise quiescent system with the deadline elevator and in moderately full XFS filetree, which however still resulted in nearly contiguous files (which should not matter much with SSDs):

#  filefrag TEST TEST2
TEST: 3 extents found
TEST2: 3 extents found

Note that the copy transfer rate sums to around 250MB/s which is close to the top transfer rate of SATA2, but also note that SATA2 is a full duplex bus, in effect interleaved read/write halves the transfer rate, even if access times on a flash SSD are very small.

121128 Wed: Browser based JavaScript standalone applications

I listened to a presentation about a text processing applicatiom, one that allows to create a subset of an XML schema (in particular a large one, TEI) by specifying which elements or attributes to include or exclude from the original.

It was a GUI based application, and it was written entirely in JavaScript, running inside a single page which it redrew. There was no connection to a web server, or multiple pages or forms.

It could have been equally written for the command line, or in Python or Java with a GUI toolkit.

So I asked rethorically why it was written in JavaScript inside a web browser and not as a standalone application, and the answer was portability.

Sure, there are even games that run in JavaScript inside a browser, and there are JavaScript web applications that have offline modes of operation, but this is an application designed from the start to run standalone inside a browser.

Put another way a browser defines a virtual machine with a virtual GUI toolkit, or a complete platform, and this is bad news for Microsoft who have made selling platforms (MS-DOS, MS-Windows, MS-Office) the core of their lock-in strategy.

What is somewhat notable is how much this has succeeded, as the application I saw at the beginning was done by an intern over a few months, and is based on a relatively rich layer of available text and XML processing toolkits.

It is also notable that this was the intended use of Java rather than JavaScript, and that even if Microsoft successfully neutered the Java platform for client applications, it may not have seen JavaScript as a platform to rival .NET, which remains however very popular.

121124 Sat: Even slow flash SSDs are quite faster than disk

As I read somewhere even rather slow flash SSDs can be preferable to disk drives in many cases.

Accordingly Crucial have now a new line of V4 flash SSD products of which there I have read some recent reviews.

The speed tests in the reviews are significantly different from those of flash SSDs of the same capacity, for example with sequential read and write speed half of 237MB/s qnd 206MB/s versus numbers over 400MB/s and 350MB/s for the best.

The difference in random access speed is even larger at 10MB/s reading and 17MB/s writing, versus over 250MB/s and 320MB/s for the best (since the multithreaded rate for the V4 is the same as the single threaded one it is very likely that in both cases the firmware driving it is not optimally designed).

However there are some important reasons why they such a relatively slow drive may be still good value, as it costs less:

SATA2 interfaces limit transfer rates to around 250MB/s anyhow, and many older desktops and most laptops have SATA2 interfaces. So the transfer rates of the example drive above are entirely adequate.
The random transfer rates are still fairly far from the SATA2 limit (and most likely consequent to poor design of the firmware in the drive controller), but they are still way higher than those of a rotating disk drive, around 10-20 times higher.
Particularly for laptops there is a much lower chance of mechanical damage and data loss for mechanical reasons, which is very useful.

Accordingly the V4 product is sold by its manufacturer as 9x faster than HDDs and the higher end M4 as 21x faster than HDDs which in some ways underestimates and in other overestimates the differences, which summarized are:

Model	Random read	Random write	Seq. read	Seq. write	Price inc. VAT
Crucial M4 SDD 256GB	55	165	272	270	£150
Crucial V4 SSD 256GB	10	17	237	206	£120
generic 7.2k disk 250GB	0.5	0.5	90-45	90-45	£45

While flash SSDs are rather faster in sequential transfers, even taking into account that disk transfers in the internal tracks are twice as slow as on the external tracks, the real benefit is latency in random IO, and that justifies the much higher price in addition to mechanical resilience.

While surprisingly the V4 random IO rates are much lower tham those of the M4, they are still so much higher than those of a disk drive that they can feel much more responsive, especially for metadata or small file intensive usage.

Probably a lower rate flash SSD like the V4 makes quite a bit of sense for a laptop, even if the price is 2-3 times that of a disk of equivalent capacity, as the additional price of around £80 for a 256GB one does not add that much to a laptop costing £400-500, percentagewise.

121117 Sat: Even lower prices for flash SSDs

Six months ago I was noticing a notable fall in price of 256GB flash SSDs to £180, so I checked again and found that pretty good 256GB models currently retail for around £115 (OCZ) or £140 (Samsung).

At those price levels and sizes I think that it is not a huge effort to put an SSD into most laptops, especially because of their far superior resistance to shocks.

As to performance, most current SSDs have potential transfer rates higher than the maximum possible with SATA2, and few laptops and most older desktops don't yet have SATA3 host adapters, so the fastest and most expensive flash SSDs are often not required, and as someone argued previously, the low latency and greater (peak) transfer rates of even a relatively cheap flash SSD are so much better than those of a disk drive that the difference between slow and fast flash SSDs is often rather less noticeable.

121113 Tue: Inane error messages in MIT Kerberos

The Kerberos authentication system is not particularly complicated or difficult but some of its behaviours and features are quite subtle, and the error messages in the MIT Kerberos implementation tend to be awful, which makes finding out the cause of issues much more difficult than it should be.

Recently I was looking at a Kerberos source file and noticed a particularly irritating example in function kadm5int_acl_match_data of source file lib/kadm5/srv/server_acl.c:

    if (ws->nwild >= 9) {
        DPRINT(DEBUG_ACL, acl_debug_level,
               ("Too many wildcards in ACL entry.\n"));
    }

The error message does not say that 9 is maximum, carefully omits to say which ACL entry is involved, and there can be thousands. Fascinatingly with a web search I found that the printing of the ACL entry involved was removed a few months ago.

Just one of many similar cases, and not just in MIT Kerberos.

121103 Sat: SSL certificate CAs not so trusted

The UK Access Management Federation for Education and Research acts as a clearinghouse for trust relationships between identity databases and their identity consumers for many large UK universities, and part of its membership requirements are the use of HTTP transactions encrypted and signed with certificates.

A fascinating aspect of this is that they require the certificate used between users and identity consuming servers to be signed by a recognized CA but the far more critical trust fabric certificate for communication between identity databases and their server clients are recommended to be self-signed.

Part of the reason why is that such certificates need changing when they expire, this is laborious, and using self-signed certificates allows creating very long lived certificates.

But the crucial point is that trust fabric certificates must be communicated securely via an authenticated channel between the federation and members, and thus there is no need for them to be signed by a third party, the CA. Also, while the federation accepts certificates signed by CAs, it does bilateral verification of those too.

But that is a very big deal: because CAs are supposed to be the mechanism because of which there is no need to authenticate certificates bilaterally.

This to me seems a massive expression of distrust of the current CAs, which may not be enjoying a huge reputation after several painful incidents and massive criticism.

121029 Mon: Double 6to4 tunnel setup for Debian

As a previous entry demonstrated two 6to4 tunnels are needed for efficient 6to4 networking and there was an example with bare commands.

Debian recommends the use of the ifup and ifdown wrappers configured by /etc/network/interfaces and this is a suitable equivalent:

iface		6to4net inet6 v4tunnel
  local		192.168.1.40
  endpoint	any
  netmask	16
  address	2002:c0a8:0128::

iface		6to4rly inet6 v4tunnel
  local		192.168.1.40
  endpoint	192.88.99.1
  netmask	3
  address	2002:c0a8:0128::

121021 Sun: Transfer rates profile of a recent disk drive

While I have been spending some time profiling flash SSDs I have taken disk drives a bit for granted, and the often poor performance of the Linux IO subsystem as a permanent fact.

I was mentioning for example my perplexity that disabling write caching on disk drives used to reduce write rates enormously, for example from 80MiB/s to 4MiB/s, but also reduced read rates significantly, which I found very perplexing.

I have rerun a few simple tests with and without caching on a relatively recent 2TB drive:

#  hdparm -W 1 /dev/sdd

/dev/sdd:
 setting drive write-caching to 1 (on)
#  dd bs=1M count=2000 if=/dev/zero oflag=direct of=/dev/sdd1
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 15.6287 seconds, 134 MB/s
#  dd bs=1M count=2000 of=/dev/zero iflag=direct if=/dev/sdd1
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 12.6821 seconds, 165 MB/s

Those 165MB/s for reading are fairly high (but inner tracks will be rather slower of course) and 134MB/s for writing are somewhat inexplicably rather lower.

#  hdparm -W 0 /dev/sdd

/dev/sdd:
 setting drive write-caching to 0 (off)
#  dd bs=1M count=2000 if=/dev/zero oflag=direct of=/dev/sdd1
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 29.4652 seconds, 71.2 MB/s
#  dd bs=1M count=2000 of=/dev/zero iflag=direct if=/dev/sdd1
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 12.6973 seconds, 165 MB/s

Here without caching the write rate has halved and the read rate is the same, which is what should have always happened. The halving of the write rate is much better than it used to be still perplexing given the pretty large block size of 1MiB.

With a different block size things are rather different for writes:

#  hdparm -W 0 /dev/sdd

/dev/sdd:
 setting drive write-caching to 0 (off)
#  dd bs=128K count=16000 if=/dev/zero oflag=direct of=/dev/sdd1
16000+0 records in
16000+0 records out
2097152000 bytes (2.1 GB) copied, 146.079 seconds, 14.4 MB/s

#  hdparm -W 1 /dev/sdd

/dev/sdd:
 setting drive write-caching to 1 (on)
#  dd bs=128K count=16000 if=/dev/zero oflag=direct of=/dev/sdd1
16000+0 records in
16000+0 records out
2097152000 bytes (2.1 GB) copied, 12.7524 seconds, 164 MB/s

Here the much lower transfer rate without caching and with a smaller write size is expected, but the higher transfer rate with caching and a smaller write size is not that expected. Probably it is due to the disk drive onboard controller being better able to pipeline small writes than larger writes.

Looking at another recent 1TB drive:

#  hdparm -W 1 /dev/sdb

/dev/sdb:
 setting drive write-caching to 1 (on)
#  dd bs=1M count=2000 if=/dev/zero oflag=direct of=/dev/sdb1
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 12.2312 seconds, 171 MB/s
#  dd bs=1M count=2000 of=/dev/zero iflag=direct if=/dev/sdb1
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 11.9435 seconds, 176 MB/s

#  hdparm -W 0 /dev/sdb

/dev/sdb:
 setting drive write-caching to 0 (off)
#  dd bs=1M count=2000 if=/dev/zero oflag=direct of=/dev/sdb1
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 28.91 seconds, 72.5 MB/s
#  dd bs=1M count=2000 of=/dev/zero iflag=direct if=/dev/sdb1
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 11.9543 seconds, 175 MB/s

Here both read and write bulk sequential rates peak around 175MB/s with caching, and the bulk write rate is somewhat less than half that without caching, but still fairly respectable. With a smaller operation size:

#  hdparm -W 0 /dev/sdb

/dev/sdb:
 setting drive write-caching to 0 (off)
#  dd bs=128K count=16000 if=/dev/zero oflag=direct of=/dev/sdb1
16000+0 records in
16000+0 records out
2097152000 bytes (2.1 GB) copied, 149.436 seconds, 14.0 MB/s
#  dd bs=128K count=16000 of=/dev/zero iflag=direct if=/dev/sdb1
16000+0 records in
16000+0 records out
2097152000 bytes (2.1 GB) copied, 12.9375 seconds, 162 MB/s

We see again a much smaller write rate for writes and only slightly reduced for reads.