This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
I have been refreshing some aspects of the configuration of my home desktop/server and laptop and since it is slightly tricky (Kerberos/NFS4) I have been frustrated.
One reason is the usual lack of relevant detail in error messages but the other is that most online documentation presents is an HOWTO as a set of steps.
What is really useful is a description of the result to aim for and only after that is detailed the details of one or more possible way to achieve it.
Because a step-by-step HOWTO
can be error
prone, and even if one wants to follow it one should check at
the end that it has achieved the correct result. Indeed just
knowing what is the correct configuration is
invaluable in investigating issues.
But I have heard many sysadms argue that it is pointless,
because they would never read that, as I don't want to
think, I don't want to understand, I just want to get it done
and when it works I don't care what is the correct
configuration
; I don't sympathize with that attitude at
all, because it relies on blind luck. and it is based on the
assumption, which most managers seem to share, that there is a
silver bullet
: that bycanning
the relevant steps can be executed without skill or thinking.
In the present case I have written a list of what is needed to get NFSv4 work with Kerberos in part from my own experience and in part from consulting a number of HOWTOs (listed in the above link).
Thanks to the continuing fall in flash SSD prices I think they are now cheap enough to be a viable alternative to 15K and 10K RPM drives in servers.
Currently I see SAS 146GB 15K drives for around $190 (ex. tax) and SATA 150GB 10K drives for $110 (ex. tax) and SATA 120GB or 128GB flash SSDs for around $110 (ex. tax) (1, 2).
The flash SSD drives are consumer-grade ones, therefore with
less
endurance
,
and also with less backup power, so more prone to lose cached
data on case of power loss, but they seem to be pretty
adequate, and their random IOPS are much better in most tests
than tbose of any disk, with several tens of thousands of
random IOPS.
The newer flash SSD drives seem far more reliable than those
of just 2-3 years ago, both as to endurance
and performance over time, but their main problem is they
don't have the same length of record as 10K and 15K
drives.
But they have several advantages, and apart from much better random performance, flash SSDs are mechanically far more robust, being largely immune to vibration which affects disk drives quite a lot, and obviously they don't generate any vibration, and are entirely silent, which can be an advantage too, as seeking disk drives can be very loud, even if usually most noise from servers is from the tiny fans, esepcially those with the motor in the middle.
It seems plausible to use them nearly routinely on servers running replicated services and to keep using disk drives:
enterprisedrives when budgets allows buying several of them for a RAID set, and there is a need for absolutely no surprises 3-5 years later.
nearlineSATA and SAS drives tend to have those features for a now modest price premium over consumer ones.
Having recently mentioned slicing resources in separate pools and keeping one (or more) in reserve for maintenance and upgrades, it occurred to me to mention a less optimal example of slicing, which seems very common.
The example is a file server with N×MTB disks available for the file storage area. The disks have been setup as a single RAID6 set of those N disks, and since there is no requirement for a single large free storage pool, the resulting block device has been sliced into N-2 MTB partitions each with its own filetree.
This setup has slices of the same RAID6 set, which has the following issues:
stripeto update the parity and syndrome parts, performance is highly correlated, in the sense that writing to one of the slices impacts all of them even if the others are only being read.
In the specific case not much of the above mattered, as the load on the fileserver was small, and it took 6-7 hours to fsck all slices one after another, which was acceptable, and since all the hardware component have had low failure rates, in several years the RAID set has not suffered a loss of data.
But for an application profile with higher requirements it might have been more appropriate to divide the available drives into RAID1 pairs, RAID5 triples, or RAID10 (or even RAID6...) quartets, each of which to contain a fully independent filetree, and to fill them in sequence.
It is often necessary to provide computing services via infrastructures that are designed together, and over the years I have found that a very good and nearly always applicable overall strategy is very similar to crop rotation and fallow fields which was discovered relartively recent to be useful in agriculture.
The idea is to split the resources into separate pools, to leave some pools unused, and to rotate the load on the pool longest in use into one of the unused ones, and then to make the longest used pool into an unused one and rebuild it.
A good example was the storage system for a large data acquisition load. Because of external pressure a large centralized system had been procured, and I felt it would be a bad idea to commit the whole capacity of the system to a specific set of technologies (RAID type, filesystem type, network storage protocol, ...), especially as the type of load from the various data acquisition systems that would be using it was not yet known.
A very smart colleague suggested dividing the setup into some slices, for example 8, and configuring only the first; then when it filled up, configuring the next slice according to the experiences of the previous one, and so on, and then when all 8 slices had filled, to reuse the first one.
This was an excellent strategy to avoid overcommitting to an initial guess about future needs. and for a storage system for data acquisition it was particularly suitable, as the filled up slices could be made read-only, thus avoiding a lot of issues, including most reducing the need for fsck.
A similiar technique is used with great advantage in the data center the Sanger Institute which is divided in four computer rooms, of which three contain operating infrastructure, and the fourth is used to assemble a new infrastructure with the latest products, and when that is ready, load is transferred to it from the oldest previous slice, and that is emptied to be ready for the next upgrade:
- 4x250 M2 Data centres.
- 2-4KW / M2 cooling.
- 1.8 MW power draw
- 1.5 PUE
- Overhead aircon, power and networking.
- Allows counter-current cooling.
- Focus on power & space efficient storage and compute.
- Technology Refresh.
- 1 data centre is an empty shell.
- Rotate into the empty room every 4 years and refurb.
- “Fallow Field” principle.
I have been following this strategy for many years at home, where I have 2-3 slices (each consisting of a laptop and server and accessories), and every 2 years or so I add a new slice with recent products and discard the oldest slice.
This is yet another reason why I reckon that having multiple smaller diverse pools of resources is better than having a large homogenous one, as they can be rotated into maintenance and upgrade phases in a staggered fashion, and thus minimize the impact of maintenance and upgrades, allowing them to be more frequent.
There are consequences of a centralized storage and VM system.
The story is that there was a sudden power loss after a couple of years of uneventful operation.
This resulted in more than 24 hours of storage layer checking and resyncing time, plus several hours of fsck time both at the overall filesystem level and during VM startup time for their internal drives.
Since there were the virtual disks of more than 100 virtual machines on that storage system, comprising most of the services of the organization, a relatively short period of power loss resulted in a couple of days of widespread unavailablity of system infrastructure.
The story is not unexpected and it confirms my preference for fairly small, diverse, replicated infrastructures and dislike of virtual machines, as:
The overall strategy is that of an internet of relatively small and self contained and local computing infrastructures for small user communities (my ideal size is for each 20 users in many cases), rather than a centralized infrastructure shared by many users, because the internet model is flexible and scales. This depending on the resource cost of the service, where relatively cheap services (for example user directories) can be more centralized than relatively costly ones (for example search engines).
While reading some transfer rate tests for recent
flash
SSDs
I was amused that interleaved read/write tests usually are not
done, and yet they are very important, because in flash memory
there is a very large difference between reading and writing,
and strategies that help with one don't help with another.
A very simple test is to first write a file, then read it, then copy it and compare. My Crucial M4 behaves fairly well, as on a SATA2 interface that has a maximum transfer rate of 260-270MBs (below the peak transfer rate of the SSD) behaves like this:
# dd bs=64k count=16000 conv=fsync if=/dev/zero of=TEST 16000+0 records in 16000+0 records out 1048576000 bytes (1.0 GB) copied, 4.97284 s, 211 MB/s
# sysctl vm/drop_caches=1 vm.drop_caches = 1 # dd bs=64k count=16000 conv=fsync of=/dev/null if=TEST dd: fsync failed for `/dev/null': Invalid argument 16000+0 records in 16000+0 records out 1048576000 bytes (1.0 GB) copied, 4.0207 s, 261 MB/s
# sysctl vm/drop_caches=1 vm.drop_caches = 1 # dd bs=64k count=16000 conv=fsync of=TEST2 if=TEST 16000+0 records in 16000+0 records out 1048576000 bytes (1.0 GB) copied, 8.29767 s, 126 MB/s
This test was run on an otherwise quiescent system with the deadline elevator and in moderately full XFS filetree, which however still resulted in nearly contiguous files (which should not matter much with SSDs):
# filefrag TEST TEST2 TEST: 3 extents found TEST2: 3 extents found
Note that the copy transfer rate sums to around 250MB/s which is close to the top transfer rate of SATA2, but also note that SATA2 is a full duplex bus, in effect interleaved read/write halves the transfer rate, even if access times on a flash SSD are very small.
I listened to a presentation about a text processing
applicatiom, one that allows to create a subset of an
XML
schema
(in particular a large one,
TEI)
by
specifying which elements or attributes
to include or exclude from the original.
It was a GUI based application, and it was written entirely in JavaScript, running inside a single page which it redrew. There was no connection to a web server, or multiple pages or forms.
It could have been equally written for the command line, or in Python or Java with a GUI toolkit.
So I asked rethorically why it was written in JavaScript inside a web browser and not as a standalone application, and the answer was portability.
Sure, there are even games that run in JavaScript inside a
browser, and there are JavaScript web applications that have
offline
modes of operation, but this is
an application designed from the start to run standalone
inside a browser.
Put another way a browser defines a virtual machine with a
virtual GUI toolkit, or a complete platform
,
and this is bad news for Microsoft
who have made selling platforms
(MS-DOS,
MS-Windows,
MS-Office)
the core of their lock-in
strategy.
What is somewhat notable is how much this has succeeded, as the application I saw at the beginning was done by an intern over a few months, and is based on a relatively rich layer of available text and XML processing toolkits.
It is also notable that this was the intended use of Java rather than JavaScript, and that even if Microsoft successfully neutered the Java platform for client applications, it may not have seen JavaScript as a platform to rival .NET, which remains however very popular.
As I read somewhere even rather slow flash SSDs can be preferable to disk drives in many cases.
Accordingly Crucial have now a new line of V4 flash SSD products of which there I have read some recent reviews.
The speed tests in the reviews are significantly different from those of flash SSDs of the same capacity, for example with sequential read and write speed half of 237MB/s qnd 206MB/s versus numbers over 400MB/s and 350MB/s for the best.
The difference in random access speed is even larger at 10MB/s reading and 17MB/s writing, versus over 250MB/s and 320MB/s for the best (since the multithreaded rate for the V4 is the same as the single threaded one it is very likely that in both cases the firmware driving it is not optimally designed).
However there are some important reasons why they such a relatively slow drive may be still good value, as it costs less:
Accordingly the V4 product is sold by its manufacturer as
9x faster than HDDs
and the higher end M4 as 21x
faster than HDDs
which in some ways underestimates and in
other overestimates the differences, which summarized are:
Model | Random read | Random write | Seq. read | Seq. write | Price inc. VAT |
---|---|---|---|---|---|
Crucial M4 SDD 256GB |
55 | 165 | 272 | 270 | £150 |
Crucial V4 SSD 256GB |
10 | 17 | 237 | 206 | £120 |
generic 7.2k disk 250GB |
0.5 | 0.5 | 90-45 | 90-45 | £45 |
While flash SSDs are rather faster in sequential transfers, even taking into account that disk transfers in the internal tracks are twice as slow as on the external tracks, the real benefit is latency in random IO, and that justifies the much higher price in addition to mechanical resilience.
While surprisingly the V4 random IO rates are much lower tham those of the M4, they are still so much higher than those of a disk drive that they can feel much more responsive, especially for metadata or small file intensive usage.
Probably a lower rate flash SSD like the V4 makes quite a bit of sense for a laptop, even if the price is 2-3 times that of a disk of equivalent capacity, as the additional price of around £80 for a 256GB one does not add that much to a laptop costing £400-500, percentagewise.
Six months ago I was noticing a notable fall in price of 256GB flash SSDs to £180, so I checked again and found that pretty good 256GB models currently retail for around £115 (OCZ) or £140 (Samsung).
At those price levels and sizes I think that it is not a huge effort to put an SSD into most laptops, especially because of their far superior resistance to shocks.
As to performance, most current SSDs have potential transfer rates higher than the maximum possible with SATA2, and few laptops and most older desktops don't yet have SATA3 host adapters, so the fastest and most expensive flash SSDs are often not required, and as someone argued previously, the low latency and greater (peak) transfer rates of even a relatively cheap flash SSD are so much better than those of a disk drive that the difference between slow and fast flash SSDs is often rather less noticeable.
The Kerberos authentication system is not particularly complicated or difficult but some of its behaviours and features are quite subtle, and the error messages in the MIT Kerberos implementation tend to be awful, which makes finding out the cause of issues much more difficult than it should be.
Recently I was looking at a Kerberos source file and noticed a particularly irritating example in function kadm5int_acl_match_data of source file lib/kadm5/srv/server_acl.c:
if (ws->nwild >= 9) { DPRINT(DEBUG_ACL, acl_debug_level, ("Too many wildcards in ACL entry.\n")); }
The error message does not say that 9 is maximum, carefully omits to say which ACL entry is involved, and there can be thousands. Fascinatingly with a web search I found that the printing of the ACL entry involved was removed a few months ago.
Just one of many similar cases, and not just in MIT Kerberos.
The UK Access Management Federation for Education and Research acts as a clearinghouse for trust relationships between identity databases and their identity consumers for many large UK universities, and part of its membership requirements are the use of HTTP transactions encrypted and signed with certificates.
A fascinating aspect of this is that
they require
the certificate used between users and identity consuming
servers to be signed by a recognized
CA
but the far more critical trust fabric
certificate for communication between identity databases and
their server clients are recommended to be self-signed.
Part of the reason why is that such certificates need changing when they expire, this is laborious, and using self-signed certificates allows creating very long lived certificates.
But the crucial point is that trust fabric certificates must be communicated securely via an authenticated channel between the federation and members, and thus there is no need for them to be signed by a third party, the CA. Also, while the federation accepts certificates signed by CAs, it does bilateral verification of those too.
But that is a very big deal: because CAs are supposed to be the mechanism because of which there is no need to authenticate certificates bilaterally.
This to me seems a massive expression of distrust of the current CAs, which may not be enjoying a huge reputation after several painful incidents and massive criticism.
As a previous entry demonstrated two 6to4 tunnels are needed for efficient 6to4 networking and there was an example with bare commands.
Debian recommends the use of the ifup and ifdown wrappers configured by /etc/network/interfaces and this is a suitable equivalent:
iface 6to4net inet6 v4tunnel local 192.168.1.40 endpoint any netmask 16 address 2002:c0a8:0128:: iface 6to4rly inet6 v4tunnel local 192.168.1.40 endpoint 192.88.99.1 netmask 3 address 2002:c0a8:0128::
While I have been spending some time profiling
flash
SSDs
I have taken disk drives a bit for granted, and the often poor
performance of the Linux IO subsystem as a permanent fact.
I was mentioning for example my perplexity that disabling write caching on disk drives used to reduce write rates enormously, for example from 80MiB/s to 4MiB/s, but also reduced read rates significantly, which I found very perplexing.
I have rerun a few simple tests with and without caching on a relatively recent 2TB drive:
# hdparm -W 1 /dev/sdd /dev/sdd: setting drive write-caching to 1 (on) # dd bs=1M count=2000 if=/dev/zero oflag=direct of=/dev/sdd1 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 15.6287 seconds, 134 MB/s # dd bs=1M count=2000 of=/dev/zero iflag=direct if=/dev/sdd1 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 12.6821 seconds, 165 MB/s
Those 165MB/s
for reading are fairly high (but inner
tracks will be rather slower of course) and 134MB/s
for
writing are somewhat inexplicably rather lower.
# hdparm -W 0 /dev/sdd /dev/sdd: setting drive write-caching to 0 (off) # dd bs=1M count=2000 if=/dev/zero oflag=direct of=/dev/sdd1 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 29.4652 seconds, 71.2 MB/s # dd bs=1M count=2000 of=/dev/zero iflag=direct if=/dev/sdd1 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 12.6973 seconds, 165 MB/s
Here without caching the write rate has halved and the read rate is the same, which is what should have always happened. The halving of the write rate is much better than it used to be still perplexing given the pretty large block size of 1MiB.
With a different block size things are rather different for writes:
# hdparm -W 0 /dev/sdd /dev/sdd: setting drive write-caching to 0 (off) # dd bs=128K count=16000 if=/dev/zero oflag=direct of=/dev/sdd1 16000+0 records in 16000+0 records out 2097152000 bytes (2.1 GB) copied, 146.079 seconds, 14.4 MB/s
# hdparm -W 1 /dev/sdd /dev/sdd: setting drive write-caching to 1 (on) # dd bs=128K count=16000 if=/dev/zero oflag=direct of=/dev/sdd1 16000+0 records in 16000+0 records out 2097152000 bytes (2.1 GB) copied, 12.7524 seconds, 164 MB/s
Here the much lower transfer rate without caching and with a smaller write size is expected, but the higher transfer rate with caching and a smaller write size is not that expected. Probably it is due to the disk drive onboard controller being better able to pipeline small writes than larger writes.
Looking at another recent 1TB drive:
# hdparm -W 1 /dev/sdb /dev/sdb: setting drive write-caching to 1 (on) # dd bs=1M count=2000 if=/dev/zero oflag=direct of=/dev/sdb1 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 12.2312 seconds, 171 MB/s # dd bs=1M count=2000 of=/dev/zero iflag=direct if=/dev/sdb1 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 11.9435 seconds, 176 MB/s
# hdparm -W 0 /dev/sdb /dev/sdb: setting drive write-caching to 0 (off) # dd bs=1M count=2000 if=/dev/zero oflag=direct of=/dev/sdb1 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 28.91 seconds, 72.5 MB/s # dd bs=1M count=2000 of=/dev/zero iflag=direct if=/dev/sdb1 2000+0 records in 2000+0 records out 2097152000 bytes (2.1 GB) copied, 11.9543 seconds, 175 MB/s
Here both read and write bulk sequential rates peak around
175MB/s
with caching, and the bulk write rate is
somewhat less than half that without caching, but still fairly
respectable. With a smaller operation size:
# hdparm -W 0 /dev/sdb /dev/sdb: setting drive write-caching to 0 (off) # dd bs=128K count=16000 if=/dev/zero oflag=direct of=/dev/sdb1 16000+0 records in 16000+0 records out 2097152000 bytes (2.1 GB) copied, 149.436 seconds, 14.0 MB/s # dd bs=128K count=16000 of=/dev/zero iflag=direct if=/dev/sdb1 16000+0 records in 16000+0 records out 2097152000 bytes (2.1 GB) copied, 12.9375 seconds, 162 MB/s
We see again a much smaller write rate for writes and only slightly reduced for reads.