Computing notes 2012 part one

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

120120 Fri: OCFS2 with DRBD or iSCSI

One obvious point that I should have mentioned in the previous entry about OCFS2 is that while it is pretty good as a standalone filesystem, it is nice to have the option to share a filetree among systems in the future. In particular for real high availability applications including redundant storage media, not merely shared storage, for example with DRBD:

The Oracle Cluster File System, version 2 (OCFS2) is a concurrent access shared storage file system developed by Oracle Corporation. Unlike its predecessor OCFS, which was specifically designed and only suitable for Oracle database payloads, OCFS2 is a general-purpose filesystem that implements most POSIX semantics. The most common use case for OCFS2 is arguably Oracle Real Application Cluster (RAC), but OCFS2 may also be used for load-balanced NFS clusters, for example.

Although originally designed for use with conventional shared storage devices, OCFS2 is equally well suited to be deployed on dual-Primary DRBD. Applications reading from the filesystem may benefit from reduced read latency due to the fact that DRBD reads from and writes to local storage, as opposed to the SAN devices OCFS2 otherwise normally runs on. In addition, DRBD adds redundancy to OCFS2 by adding an additional copy to every filesystem image, as opposed to just a single filesystem image that is merely shared.

Another application is with shared non redundant storage, that is some kind of SAN layer which can be purchased ready made or built using a Linux host running some iSCSI dæmon (1, 2).

120119 Fri: Test pattern for LCDs helps VGA autosync

It is noticeable how much picture quality improves when a monitor that is fed a video signal via a traditional VGA analog link is well synchronized with it. The more so I guess with LCDs as their rigid geometry makes poor synchronization map the image onto it particularly poorly.

Contemporary monitors tend to have both manual and automatic synchronization (positions, phase and clock) and manual synchronization usually takes time, so automatic is a much easier option. However it sometimes does not get an optimal result.

In a very useful monitor testing suite by Lagom there is a page with a test specifically for good synchronization which uses a high contrast and frequency pattern to give a visual indication of how well synchronized the video signal is to the LCD geometry. The pattern suffers from obvious spatial aliasing if synchronization is poor.

Since I suspect that autosynchronization works by maximizing something like contrast in the signal. I noticed that autosync seemed better when I had terminal windows open covering the screen, and then wondered whether autosyncing with that pattern would help it converge. Indeed it does, and using the test pattern as a tile on large part of the screen (such as the background) seems to help autosync considerably (both in time needed and as to the resulting quality) in several monitors that I have used.

120118 Thu: OCFS2 a nice filesystem with good performance

I have tried a number of filesystems over time and my favourites have always been XFS JFS because of their blend of good features and scalable performance, but there are a few others that are interesting.

These are not the currently popular favourites, the ext3 and ext4 ones, because too many compromises have been made in their design for the sake of backwards compatibility, and BTRFS because it is still a bit immature and its copy-on-write design is a relative novelty.

Unfortunately interesting filesystems like ReiserFS and Reiser4 are less well maintained than others.

A filesystem with rather interesting qualities and who is still well maintained with the support of a large sponsor like Oracle is OCFS2 which has a sound traditional design and it supports all the recent kernel features that have been added to ext4 and XFS (but not for example to JFS) like the use of barriers and TRIM or FITRIM, checksums (metadata only) and a form of snapshots, and space reservation and release.

Its main design goal was to be a cluster shared filesystem, but it also works pretty well in standlone mode. I have also found among my notes some very simple performance tests among OCFS2, ext4, XFS and JFS on an old spare test machine which is somewhat slow by contemporary standards. The top hard disk speed is:

#  hdparm -t /dev/sda

/dev/sda:
 Timing buffered disk reads:  160 MB in  3.01 seconds =  53.07 MB/sec

The test is a dump via tar of a GNU/Linux root filetree, that is with a large number of small files and some bigger ones, the filetree is in a partition not at the beginning of the disk, and the results in order of speed are:

ext4 with 4KiB inodes
#  sysctl vm/drop_caches=3
vm.drop_caches = 3
#  time tar -c -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null
35900+0 records in
35900+0 records out
9410969600 bytes (9.4 GB) copied, 677.7 seconds, 13.9 MB/s

real    11m18.762s
user    0m3.139s
sys     1m1.893s
ext4 with 1KiB inodes
#  sysctl vm/drop_caches=3
vm.drop_caches = 3
#  time tar -cS -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null
35582+0 records in
35582+0 records out
9327607808 bytes (9.3 GB) copied, 620.853 seconds, 15.0 MB/s

real    10m20.895s
user    0m3.387s
sys     1m1.026s
ext4 with 256B inodes
#  sysctl vm/drop_caches=3
vm.drop_caches = 3
#  time tar -c -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null
35901+0 records in
35901+0 records out
9411231744 bytes (9.4 GB) copied, 579.691 seconds, 16.2 MB/s

real    9m40.770s
user    0m3.060s
sys     0m59.262s
XFS
#  sysctl vm/drop_caches=3
vm.drop_caches = 3
#  time tar -c -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null
35907+0 records in
35907+0 records out
9412804608 bytes (9.4 GB) copied, 413.755 seconds, 22.7 MB/s

real    6m54.763s
user    0m3.014s
sys     1m3.816s
OCFS2
#  sysctl vm/drop_caches=3
vm.drop_caches = 3
#  time tar -c -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null
35901+0 records in
35901+0 records out
9411231744 bytes (9.4 GB) copied, 368.144 seconds, 25.6 MB/s

real    6m9.149s
user    0m3.204s
sys     1m27.334s
JFS
#  sysctl vm/drop_caches=3
vm.drop_caches = 3
#  time tar -c -b 512 -f - -C /mnt/sda5 . | dd bs=512b of=/dev/null
35901+0 records in
35901+0 records out
9411231744 bytes (9.4 GB) copied, 301.641 seconds, 31.2 MB/s

real    5m2.679s
user    0m3.051s
sys     0m59.131s

A simple performance test like this is not that rich in information, in particular because the filetree is freshly loaded and that is not a representative situation, but I think that it is somewhat indicative.

I also think that OCFS2 is going to be supported by Oracle for a long time, as it is very popular among users of Oracle Database, and in particular for its RAC mode

and that involves a lot of money for Oracle.

120117 Tue: LCD panels manufactured at a loss

It is a bit worrying that LCD panels have been manufactured at a loss for years because it means that rather unsurprisingly the LCD panel industry is similar to the RAM industry, with very costly factories having long build cycles, and thus creating what economists call a hog cycle with periods of high profits when new factories are being built and thus supply is scarce, and of large losses when new factories start operating and supply is ample.

The worry is that when supply is overabundant the cycle can become extreme as suppliers completely exit the market, and eventually supply becomes too scarce as new factories stop being built. Also excessive investment in factories for one technlogy can discourage investment in factories for a better technology, and several are indeed possible like OLED panels.

But for the time being extreme demand from high end mobile telephones and tablet buyers should help LCD manufacturers decide to stay in the LCD business, and in the short term one can have really impressive LCD monitor bargains.

120116 Mon: Chassis with 8 independent server slots within it, and alternatives

While reading the current issue of PC Pro magazine I found the review of the Boston Quattro 1332-T chassis quite interesting. The idea is to put in a 3U chassis 2 power supplies and 8 single-socket server nodes, each with a Xeon E3 series chip, which are at the lower speed but also lower power end of the full Xeon spectrum, but still quite fast. Indeed the power consumption figures are particularly interesting:

As a baseline, we measured the chassis with all nodes turned off consuming only 15W. With one node powered up we measured usage with Windows Server 2008 R2 in idle settling at 102W. Running the SiSoft Sandra benchmarking app saw consumption peak at 138W.

We then powered up more nodes and took idle and peak measurements along the way. In idle we saw two, four and eight nodes draw a total of 128W, 154W and 222W; under load these figures peaked at 182W, 289W and 512W. These are respectable figures, equating to an average power draw per node of only 28W in idle and 64W under heavy load.

Especially considering the cost of supplying power and cooling those are remarkable and attractive numbers.

Boston like most other smaller server suppliers (like the familiar Transtec and Viglen) often design their products around SuperMicro or TYAN building blocks, and these have included in the recent past some similar solutions such as 1U chassis with 2 dual-socket server nodes which works out at 6 servers per 3U, instead of 8 servers (but 12 sockets instead of 8 sockets), or 4U chassis with 18 single-socket server nodes. In all these the configuration of each server node is not quite the same, and the Boston 1332-T mentioned earliers seems to have a particularly rich configuration, even if accepts only Xeon E3 series chips, while the 1U chassis with 2 server nodes can usually take more powerful Xeon chips.

The attractions of all these alternatives is that they are designed for a devirtualization strategy, in which sites who have found out how huge the overheads, bugs and administrative complications of virtual machines are for storage and network intensive workloads can undo that mistake, and the workload should be partitioned into different domains, for example for higher resilience.

The alternative devirtualization strategy is to use servers with multiple processor sockets and chips with many CPUs like the recent 12×CPU chips and 16×CPU chips from AMD which are awesome for highly multithreaded applications that are uneconomical to partition like web servers accessing a common storage backend, for embarassingly parallel HPC applications like Montecarlo simulations that are popular in finance and high-energy physics.

120115 Sun: Flash SSD maximum latency and interleaved reads and writes

Continuing the discussion of the rather complicated subject of the performance envelope of flash based SSDs, there are two aspects of it that are rarely looked in in reviews, and they are related: concurrent reading and writing and maximum operation latency.

They are related because several FTL fimwares perform very badly when doing concurrent (which really mean closely interleaved as the interface into a flash SSD is as a rule single channel, such as USB or SATA), and this manifest itself as very high operation latencies.

As to latencies there is a fascinating graph from a recent review of several contemporary flash SSD products which shows maximum latencies of over 60 milliseconds for several on random small writes (and negligible ones on random small reads).

There are also other tests that show that certain flash SSDs perform well in random and sequential access patterns, as long as they are distinc and not interleaved, but then fall considerably if reads and writes are interleaved, as they would in most realistic usage patterns.

My guess is that the two phenomena are related, and in particular because:

The resulting performance levels are still usually very much better (2-3 orders of manitude better), just as for random pure reads and pure writes, than for rotating storage devices, as for the latter positioning is so expensive that it overshadows the cost of dealing with erasing flash blocks.

Yet in both cases above there is a reflection of the cost of having the FTL perform houseeking in the background and having to rely on the ability of the FTL authors to provide a less unbalanced performance envelope, and some authors seem to aim for that, and other for higher peaks. Part of the reason why I chose the Crucial M4 is that various tests seem to show that the authors of its FTL seem to have aimed for a less unbalanced performance envelope, which means fewer surprises. Anyhow since my laptop does not have a SATA3 interface, the possible performance peaks are not accessible.

120114 Sat: Flash SSD data lifetime

Being a major business supplier Dell tend to have somewhat better documentation than most, and I was happy to find on their site some fairly reliable and interesting introduction to flash SSD drives and the differences between high end and low end ones: Dell™ Solid State Disk (SSD) Drives – High Performance and Long Product Life and an even more interesting Solid State Drive (SSD) FAQ which has some points about little known aspects of flash SSDs. The first was news to me, even if in retrospect it is quite understandable (a lot of what suppliers do is about minimizing warranty returns):

5. Why I might notice a decrease in write performance when I compare a used drive to a new drive?

SSD drives are intended for use in environments that perform a majority of reads vs. writes. In order for drives to live up to a specific warranty period, MLC drives will often have an endurance management mechanism built into the drives. If the drive projects that the useful life is going to fall short of its warranty, the drive will use a throttling mechanism to slow down the speed of the writes.

But there are r were technical reasons why writes could become slower with the life of the drive: especially in the absence of TRIM style commands all flash blocks could become partially used, requiring read-modify-erase-write cycles on every update. The second point surprised even me, as I was aware that flash memory weakens with time, but not that quickly:

6. I have unplugged my SSD drive and put it into storage. How long can I expect the drive to retain my data without needing to plug the drive back in?

It depends on the how much the flash has been used (P/E cycle used), type of flash, and storage temperature. In MLC and SLC, this can be as low as 3 months and best case can be more than 10 years. The retention is highly dependent on temperature and workload.

NAND Technology Data Retention @ rated P/E cycle
SLC 6 Months
eMLC 3 months
MLC 3 Months

Data Retention:

Data retention is the timespan over which a ROM remains accurately readable. It is how long the cell would maintain its programmed state when the chip is not under power bias. Data retention is very sensitive to number of P/E cycle put on the flash cell and also dependent on external environment. High temperature tends to reduce retention duration. Number of read cycles performed can also degrade this retention.

This of course happens to flash card or stick pocket SSDs, but I suspect that they have some capacitor for trickle current to keep the data refreshed. Or else it explains why many people have difficulty rereading photographs etc. from pocket SSDs if left on a shelf for a long time. Still most issues with them are about the limited number of erase cycles, as most pocket flash SSDs eventually fail because of erase cycle damage (as they tend not to have overprovisioning unless in USB form factor and from a good supplier). But it would still be a good idea in general to keep flash memory devices plugged in so they have power to allow (if possible) to refresh decaying memory charges.

Anyhow magnetic recording storage like conventional hard drives are not going to lose their primacy as backup storage, and not just because I can still read without problems ATA disks I had many years ago, but also because they are much cheaper and perform equally well in writing as reading, and pretty well in absolute terms in streaming mode (withlimited seeking).

Another interesting note concerns queue depth which was seen to have an important effect as to efficiency (along of course with blocks sizes) in some previously mentioned SSD tests:

  • Varying Queue Depths: Queue depth is an important factor for systems and storage devices. Efficiencies can be gained from increasing queue depth to the SSD devices which allow for more efficient handling of write operations and may also help reduce write amplification that can affect the endurance life of the SSD.

The reasoning seems to be that higher queue depth helps provide larger amounts of data to buffer in the somewhat large onboard buffer cache (256MiB for my Crucial M4) the FTL is given, and thus better opportunities to arrange the data to be written in large aggregates thus reducing the numer of erase cycles.

But precisely because the onboard buffer cache is fairly large, I am somewhat skeptical that writing is really going to be improved by higher queue depths rather than simply larger write transactions or just closely spaced ones. It may be instead improving random small reads, and indeed in the previously mentioned test the read rate for 4KiB blocks with a 32 threads reading is three times higher than with a single thread. Which however is a bit surprising as there are no erase issues with reads, which leads me to think that there is some kind of latency (perhaps in the MS-Windows IO code) that impact single threaded random reads.

120113 Fri: How much written to disk per day on my laptop

As to flash SSD endurance I have been running for a day a command to report every four hours the amount of write performed to my laptop flash SSD:

#  date; iostat -dk sda $((3600*4))
Thu Jan 12 10:32:41 GMT 2012
Linux 3.0.0-14-generic (tree.ty.sabi.co.UK)     12/01/12        _x86_64_        (2 CPU)

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               2.75        20.64        28.48    3005808    4147957

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               1.67         4.67        33.35      67180     480226

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              21.78       115.60        60.75    1664696     874750

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               3.53        28.51        63.55     410512     915110

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               5.74        91.91        65.65    1323452     945326

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.33         0.02         4.66        252      67152

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               2.30        13.54        22.49     194956     323885

^C
 tree#  uptime; iostat -dk sda
 13:05:17 up 2 days, 18:59,  9 users,  load average: 0.03, 0.04, 0.05
Linux 3.0.0-14-generic (tree.ty.sabi.co.UK)     13/01/12        _x86_64_        (2 CPU)

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               3.85        27.89        33.48    6727568    8075561

The first report is for total since boot up to that point, the next six are for the previous four hours. The average is well under 4GiB per day, and curiously the amount written is considerably higher than the amount read (except for a surge when I RSYNC'ed with my backups server). I have investigated in the past disk activity on my laptop with a view to minimize it to save power by letting it stay in standby as long as possible, and the write rates are due to some poorly written programs that keep updating or saving some files needlessly.

120112 Thu: Chinese parallel system uses Chinese CPUs

The Chinese government may be paranoid (USA designed chips may have backdoors), or simply they understand economics (chips are high value added), and they have been funding the development of Chinese designed CPU chips for a long time, focusing in particular on those bvased on MIPS architectures.

A relatively recent article notes that a large Chinese cluster uses Chinese CPU chips which is significant news. The CPU chips are indeed implementations of the MIPS architecture and their microarchitecture is rumoured to be heavily inspired by that of the DEC Alpha CPU chips.

Presumably because the Chinese government do not have yet the very expensive chip factories needed to build contemporary technology chips (but I hope that they will invest in or buy AMD and its fabrication spinoff) so the cips runs only at 1.1GHz, and therefore aims more at being low power and having lots of CPUs (each chip has 16 CPUs) to compensate for that. Which I often think is the right approach for the embarassingly parallel sort of problems for which clusters are suitable.

As to being low power, some sources note that a 0.79PFLOPS cluster only draws 1MWatt is quite remarkable, as in this list of power efficient clusters almost all the more power efficient clusters are much smaller than it, and clusters of similar overall computing power draw a lot more electrical power, and this is very interesting because cluster cooling is one of the biggest problem in HPC:

First, with a max draw of around one megawatt, it is incredibly power-efficient. Its contemporaries at the top of the supercomputing charts use at least two megawatts and the US’s fastest supercomputer, Jaguar, draws no less than seven megawatts.

120111 Wed: Ambiguous file system terminology and change

Since I work a lot on storage issues I think a lot about them too, as this blog probably shows, this requires good terminology to avoid long composite names and unproductive ambiguity.

One of the complications that I have been struggling with is that existing terminology engenders a confusion between a type of file system (for example JFS or XFS) and a specific instance with specific content. There has been a common and rather dangerous convention that I have followed so far where file system (note the space) means the type and filesystem means an instance.

So for example one could say that the block device /dev/sda6 contains an XFS filesystem, and that XFS is a file system particularly suitable for large filesystems.

I have decided that is rather too dependent on context, error prone, and that always saying file system type is too awkward.

Therefore I will be using file-system to denote the type, the structure, and file-tree to denote an instance. Conceivably some file-system allows non-tree like instances, but essentially all contemporary ones only allow trees (with shared leaves, or at most branches) so file-tree is not a misnomer.

120110 Tue: Flash drive endurance, chip structure, and short stroking

While revising the specification sheet of my recently acquired flash SSD drive I noticed that under Endurance it says 72TB or 40GB per day for 5 years. So I looked up some photos of the components making up the devices and there 16×29F128G08 CFAAB units to check on their physical endurance properties.

These are chip carriers containing two MLC, 25nm, synchronous, 128Gib/8GiB chips with 4KiB flash-page, 1MiB flash-blocks and an endurance of 3000 erase cycles. If each chip were to be erased 3000 times and each time fully written, that would be 768TB, that is around 10 times more than in the specification sheet.

The difference is so large (a bit more than an order of magnitude) that it is unlikely that it is just a conservative estimate, the manufacturer also probably expects that in most cases flash-blocks will be erased without then writing in all the flash-pages they can hold.

There is also the problem that as soon as one flash-block reaches the maximum erase count (which is presumably slightly different for each chip or even block), it becomes unusable and thus the effective capacity of the device is reduced by that block.

This is why the FTL of flash SSDs aims for wear leveling by mapping the logical device onto the flash chips in a circular queue (similar to the lower level of a log structured file-system).

But in addition the firmware of rotating storage and flash SSD devices reserve a chunk of storage capacity for logically replacing failed parts of the capacity, and this is called sparing for rotating storage devices and overprovisioning for flash SSDs.

In the case of flash SSDs despite wear-leveling some flash-blocks will reach their maximum erase count before others, and those will then be ignored and a flash-block from the overprovisioned reserve will be used in their stead. The overprovisioning for my consumer-grade flash SSD is around 7% of total capacity (enterprise-grade flash SSDs tend to have a lot more). This is typical of consumer level flash SSDs, because they use chips with capacities measured in powers of 2, but offer visible capacities measured in powers of 10, so they have 256 gibibytes of physical capacity, but 256 gigabytes of visible capacity, and since 256 gigabytes are about 238 gibibytes, there are around 18 gibibytes of invisible capacity which gets turned into 7% overprovisioning.

Overprovisioning is also used to make writes faster, by keeping a reserve of empty flash-blocks ready to accept a stream of writes without having to be erased on the fly, and this can make a big difference to write latency as shown in this blog post by a manufacturer, but this does not need to consume a lot of the reserve unless there are very long surges of random writes.

But if erase endurance is a worry (not in the case of my laptop storage unit) one can short-stroke a consumer flash SSD to greatly increase its erase endurance. This means leaving some part of the visible capacity unused, for example by only allocating half of the visible capacity to partitions containing filetrees.

This means that half of all flash-blocks would not be written to directly, and that since the whole visible capacity is part of a single circular erase queue for wear leveling, writes on the logically allocated half of the capacity would spread on all flash blocks, reducing the average erase cycle count per block to half of what it would be otherwise.

This is quite similar to short-stroking, but the effect is not to reduce average travel time of the head as per rotating storage devices, but to reduce the the average number of erases per flash-block; it also significantly improves the chances of finding an empty flash-block reducing during a random-write surge just like overprovisioning built into the FTL.

Simply not filling more than some percentage of each file-tree is probably going to have much the same effect as only allocating an equivalent percentage of visible capacity to file-trees. But I suspect that it is best to just reduce the size of the file-tree, both because that is a crude way of doing an allocated space restriction, and because leaving a part of the visible capacity wholly untouched may help the FTL to do its mappings better.

120109b Mon: The noisy seeks of a flash SSD?

While doing some RSYNC backups on my new flash SSD in my laptop I half noticed a strong clicking or buzzing sound at the same time as heavy seeking on the filetree (also as evident from the storage activity light on the front of the laptop), and directly related and proportional to it.

Then I realized that it was impossible: I was sort of habituated to the noise because it was nearly identical to the noise made while seeking by a rotating storage device with a magnetic pickup arm going back and forth. But flash based SSDs don't have anything like that.

This greatly perplexed me and I did some web searches and something similar seems to have been noticed by other flash SSD users, which was indeed strange, so I decided to investigate. First I wanted to verify the source of the noise, and eventually I found that the noise was coming from the loudspeakers of the laptop. This was very perplexing itself because I had earphones plugged into its sound output jack and this should redirect sound to the earphones, as indeed I was able to verify.

Then I looked for some possible background dæmon that would report disk activity with a buzzing sound, or perhaps something in the laptop hardware that was designed to have the same effects, and while looking I unplugged the earphones to plug in some external loudspeakers, and the seek noise stopped, and the laptop loudspeakers became silent while nothing was plugged into the sound output jack. Plugging in the earphones (or external loudspeakers) made the seek noise reappear.

My conclusion is that this is because of imperfect isolation of parts of the sound circuits of the laptop from the electrical noise of the computing circuits: when the earphones are not plugged in, the loudspeakers are driven by the sound chip in the laptop, and when the earphones are plugged in that chip is connected to them, and perhaps then the loudspeakers are connected to ground or whatever and pick up computing circuits electrical noise, most likely bus noise, and that has some harmonics in the audible range. Amusingly these are also nearly identical to rotating storage device seek noise.

My impressions is supported by previous experience: when using cheap sound cardsin some desktop PC I noticed that there was a characteristic background noise that was obviously related to CPU and memory activity.

Indeed it is somewhat daring to put a sound circuit in the same box and electrically connected to a lot of other electrically noisy electronics, and some people prefer to use external USB sound devices to minimize electrical couplings and reduce background noise (but poorly designed USB sound devices can pick up noise from the USB connection).

120109 Mon: The four drive home server and storage setup choices

I have a long list of parity RAID perversities to report, usually about overly wide, complicated setups, but there is a very recent one that in several different ways is typical and it is about just a narrow 2+2 RAID6 set:

I got a SMART error email yesterday from my home server with a 4 x 1Tb RAID6.

Why not a 4 drive RAID10? In general there are vanishingly few cases in which RAID6 makes sense, and in the 4 drive case a RAID10 makes even more sense than usual. Especially with the really cool setup options that MD RAID10 offers.

The main reason is the easy ability to grow the RAID6 to an extra drive when I need the space. I've just about allocated all of the array to various VMs and file storage. One thats full, its easier to add another 1Tb drive, grow the RAID, grow the PV and then either add more LVs or grow the ones that need it.

In this case, the raid6 can suffer the loss of any two drives and continue operating. Raid10 cannot, unless you give up more space for triple redundancy.

Basic trade-off: speed vs. safety.

The idea of a 2+2 RAID6 set seems crazy to me because not only RAID6 is in almost every case a bad idea based I think on huge misconception about failure probabilities but also because the alternatives are so much better, and I'll list them here:

RAID5

This seems to be a one of the few cases where RAID5 seems appropriate because it is doubtful whether one needs that much redundancy for a home server, it is going most likely to be almost entirely read-only (typical home servers are used as media archives).

It also also allows saving one drive that can go towards making offline backups of half of the content (for example with a cheap external eSATA cradle), and a 2+1 RAID5 would have much better rebuild times and most likely (write) performance than a RAID6.

RAID10

This would have the same usable capacity, enormously better write performance independent of alignment requirements, much shorter rebuild times.

There is also the advantage of being able to continue working after the failure of any number of non paired units, while RAID6 can continue working after the failure of up to 2 units. In this case it is not relevant though, as there are only 2 pairs of drives, so RAID6 seems to have the advantage, which arguably it does not in the general case.

RAID10 continues to have another large advantage though: that rebuild times are much shorter and less stressful on the hardware, because they involve just the remirroring of the affected pair(s), while rebuilding with RAID6 involves the rereading of all old units interleaved with the writing of the new units.

The latter is significant because the essential difference between RAID10 and RAID6 (or RAID5) is that in the latter all units are tangled by the the parity blocks. Having all drives active means that rebuilds are usually rather slower (for RAID6 also because parity computations are somewhat expensive) and as usually the drives are physically contiguous a rebuild means a lot of extra load on the drive set and thus extra vibration and heat and power draw that often leads to additional failures. Something that also applies in general to writing with RAID6 (with RAID5 one can use the shortcut of reading just the block to be written and the parity block of any stripe).

In this I think that RAID10 is only slightly better than RAID6 overall. Including the idea of resizing the volume, because resizing RAID sets or even just filetrees is a pretty desperate move because:

  • Resizing RAID sets involves reshaping stripes, with lots of reads and writes across all the units, and the reshaped stripes are likely to be far from optimal as to the performance layout of the contained filetree. It is a long operation that stresses the hardware like a rebuild after failure, only more so, and it is complicated and difficult.
  • Resizing a filetree contained on a RAID6 has performance implications because of the changed geometry of the underlying storage layer, which turns any RMW-optimized layout chosen by the file system code to that point into a rather badly performing one.
  • Resizing a filetree is often not a very complicated operation, often amounting to extending the free space inventory of the filetree, but it also leads to a lower performance even on a non-RAID storage layer because it skews the distribution of data and metadata in a timewise rather than spacewise direction, impairing locality of access.
2 RAID1 pairs

Even just 2 RAID1 pairs may be a good idea for a home server (or a production server) compared to a 2+2 RAID6, a 2+1 RAID5, or a 2×(1+1) RAID10, because it is a very simple layout with two fully independent areas.

Read performance may not be as high, because each area can only draw on one disk, but write rates are likely to be the same, and the advantage is not just simple setup, but also fully independent operation of the pairs. One of the two pairs can die entirely, and the other continues to work.

Among the benefits of simplicity is that by using just RAID1 it can be rather easy to boot fully from a pair, while it can be a bit more complicated with a non-RAID1 pair.

2 independent pairs periodically cloned

All the previous setups have in my eyes a serious defect: they involve continuous redundancy, that is the members of a redundancy set are constantly in-use and subject to largely the same stress.

Often I prefer a hot+warm arrangement, where there are pairs of storage units, and one is online and hot, and gets periodically copied to the other member of the pair which is online and warm.

This can happen with drives which are in the same drive enclosure or one is outside, and by block-by-block copying or temporary RAID1 remirroring.

The advantage of this arrangement is that if internal the backup drive can go into sleep mode most of the time, and in any case is subject to a lot less load and different stress than a RAIDed one. Also, previous versions of files can be recovered from the backup drive.

Also, the idea of reshaping A RAID6 into a larger one and using LVM2 seems very risk to me, as it is a particularly stressful rebuild, and one that leaves the contained filetree misaligned as to RMW.

For my home server I use the hot-warm arrangement with 2 levels of backup: for every hot drive a warm backup drive in the PC box, which is in sleep mode almost all day, and gets a few hours of mirroring during each night by way of a CRON script, and one or more offline cold backup drives that get mirrored somewhat periodically in an eSATA external box.

For work I tend to have RAID1 pairs or triples for data which does not require high write rates, and RAID10s set for those that do. Very rarely narrow (2+1 or 3+1) RAID5 sets in the few cases they make sense and even more rarely RAID0 sets for high transfer rate cases for volatile data. Never RAID6 unless upper management believes a storage salesman's facile promises and tell me to just put up with it.

As a rule using MD instead of hardware RAID host adapters (most have amazingly buggy firmware or bizarre configuration limitations), virtually never DM/LVM2 with the exception being for snapshots (and I am very interested in file-systems that have built-in snapshotting, usually because they are based on COW implementations).

120108b Sun: Block device duplication transfer rates

Having mentioned the speed of treewise filetree level copies across disks on on my home server these are the speeds of linear block device copies on the same server, first those between 1TB disks, and then between 2TB disks, both nightly image backups:

'/dev/sda1' to '/dev/sdb1': 25603+1 records in
25603+1 records out
26847313920 bytes (27 GB) copied, 243.94 seconds, 110 MB/s 
25603+1 records in
25603+1 records out
26847313920 bytes (27 GB) copied, 243.943 seconds, 110 MB/s

real    4m3.944s 
user    0m0.071s
sys     0m18.332s
'/dev/sda3' to '/dev/sdb3': 25603+1 records in
25603+1 records out
26847313920 bytes (27 GB) copied, 242.319 seconds, 111 MB/s
25603+1 records in
25603+1 records out
26847313920 bytes (27 GB) copied, 242.322 seconds, 111 MB/s

real    4m2.405s 
user    0m0.072s
sys     0m18.429s
'/dev/sda6' to '/dev/sdb6': 95385+1 records in 
95385+1 records out 
100018946048 bytes (100 GB) copied, 917.785 seconds, 109 MB/s
95385+1 records in 
95385+1 records out 
100018946048 bytes (100 GB) copied, 917.791 seconds, 109 MB/s

real    15m17.792s
user    0m0.245s
sys     1m8.049s 
'/dev/sda7' to '/dev/sdb7': 85344+1 records in 
85344+1 records out 
89490522112 bytes (89 GB) copied, 845.122 seconds, 106 MB/s  
85344+1 records in 
85344+1 records out 
89490522112 bytes (89 GB) copied, 845.126 seconds, 106 MB/s  

real    14m5.798s
user    0m0.223s
sys     1m1.021s 
'/dev/sda8' to '/dev/sdb8': 237962+1 records in
237962+1 records out
249521897472 bytes (250 GB) copied, 2532.82 seconds, 98.5 MB/s
237962+1 records in
237962+1 records out
249521897472 bytes (250 GB) copied, 2532.83 seconds, 98.5 MB/s

real    42m14.171s 
user    0m0.707s
sys     2m49.092s
'/dev/sda9' to '/dev/sdb9': 475423+1 records in                                                 
475423+1 records out                                                                            
498517671936 bytes (499 GB) copied, 6585.2 seconds, 75.7 MB/s                                   
475423+1 records in                                                                             
475423+1 records out                                                                            
498517671936 bytes (499 GB) copied, 6585.2 seconds, 75.7 MB/s                                   
                                                                                                
real    109m47.678s                                                                             
user    0m1.322s                                                                                
sys     5m35.689s
'/dev/sdc1' to '/dev/sdd1': 25603+1 records in
25603+1 records out
26847313920 bytes (27 GB) copied, 206.092 seconds, 130 MB/s
25603+1 records in
25603+1 records out
26847313920 bytes (27 GB) copied, 206.095 seconds, 130 MB/s

real    3m26.782s
user    0m0.055s
sys     0m15.635s
'/dev/sdc3' to '/dev/sdd3': 25603+1 records in
25603+1 records out
26847313920 bytes (27 GB) copied, 206.394 seconds, 130 MB/s
25603+1 records in
25603+1 records out
26847313920 bytes (27 GB) copied, 206.396 seconds, 130 MB/s

real    3m26.397s
user    0m0.057s
sys     0m15.830s
'/dev/sdc6' to '/dev/sdd6': 419195+1 records in
419195+1 records out
439558766592 bytes (440 GB) copied, 3507.74 seconds, 125 MB/s
419195+1 records in
419195+1 records out
439558766592 bytes (440 GB) copied, 3507.74 seconds, 125 MB/s

real    58m29.626s
user    0m0.938s
sys     4m17.484s
'/dev/sdc7' to '/dev/sdd7': 475423+1 records in
475423+1 records out
498517671936 bytes (499 GB) copied, 4321.96 seconds, 115 MB/s
475423+1 records in
475423+1 records out
498517671936 bytes (499 GB) copied, 4321.96 seconds, 115 MB/s

real    72m3.533s
user    0m1.214s
sys     5m17.721s
'/dev/sdc8' to '/dev/sdd8': 950344+1 records in
950344+1 records out
996508860416 bytes (997 GB) copied, 11258.7 seconds, 88.5 MB/s
950344+1 records in
950344+1 records out
996508860416 bytes (997 GB) copied, 11258.7 seconds, 88.5 MB/s

real    187m44.220s
user    0m2.742s
sys     11m4.643s

The notable features are the satisfactory linear speeds, which are not however that much higher than the treewise ones, the expected decline as to the inner tracks, and the significant improvement over previous reports from April 2006 May 2007, June 2009.

120108 Sun: Good transfer rates on bulk filetree copy

After regretfully converting my laptop partitions from JFS to XFS I am converting my server's partitions too, by copying them from a JFS to a XFS partition on the backup disk to the XFS partition on the live disk, and I have 2 live (and thus 2 backup) disks, and this is the aggregate transfer rate for 2 concurrent filetree (somewhat outer cylinders) copies:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  3      0 102440    328 3693320    0    0 153876 158236 4766 14750  0 29 20 50  0
 0  3      0 102012    320 3693716    0    0 148364 139740 5176 15396  0 28 32 39  0
 4  5      0 113916    332 3681632    0    0 99772 110472 4129 10544  0 17 38 45  0
 1  3      0 103692    356 3691424    0    0 208552 199824 5817 19128  0 40 14 46  0
 1  2      0  96316    364 3699624    0    0 108808 100068 4715 11346  0 22 42 36  0
 1  2      0 110456    372 3685308    0    0 171912 168132 5656 16903  0 37 21 42  0
 1  4      0 120624    352 3675564    0    0 133116 117740 5057 14079  0 26 39 35  0
 2  3      0 103516    332 3692568    0    0 208896 203128 8105 22806  0 51 15 33  0
 2  3      0 114960    304 3681052    0    0 197512 193936 7624 22460  1 51 15 33  0
 3  3      0 110404    268 3685924    0    0 208128 201532 8046 22699  1 52 13 34  0
 3  3      0 105964    256 3690176    0    0 206720 209336 7977 22331  0 51 16 33  0
 2  3      0 107744    244 3688060    0    0 199320 186580 7673 21648  0 49 17 34  0
 5  4      0 102328    236 3693952    0    0 209024 203032 8005 23010  0 53 15 32  0
 3  3      0 103052    220 3693340    0    0 219392 214136 8617 23030  0 54 14 32  0
 3  1      0 111684    212 3684580    0    0 200960 194764 7551 23520  0 51 12 37  0
 2  4      0  98528    212 3697812    0    0 201600 197308 7527 21233  0 48 17 35  0
 0  5      0  97140    212 3699264    0    0 204928 202348 7776 23740  1 51 14 34  0
 4  4      0 118084    160 3678132    0    0 208128 182308 7838 24767  0 51 13 36  0
 3  5      0 106352    144 3690336    0    0 203904 168648 7402 24999  0 51 13 36  0
 6  4      0  97688    140 3697916    0    0 211708 166412 7688 26822  0 53 11 35  0
 0  5      0  95012    128 3698996    0    0 217092 174740 8022 28231  0 56 11 33  0

This is done with a tar -c -f - .... | tar -x -f - ... style pipeline in each case. The transfer rates are pretty good (in the period above I think the transfer was relatively large files like photographs). This is obviously because both JFS and XFS perform quite well, and I sometimes (too infrequently) copy the filetrees in the same way to improve locality.

The fairly high system CPU time is due in part (around 10-15% of available CPU time) to the usual high CPU overheads of the Linux page cache, and in part (around 30-40% of available CPU time) to one of the filetrees involved being on an two encrypted blocks devices. While I understand that encryption is expensive, the percentage of time devoted to just managing the page cache is fairly notable.

A single copy on an unencrypted filetree on inner cylinders is running like this a bit later (while copying music tracks of a few MiB each):

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  2      0 101900    172 3713000    0    0 68616 59820 2620 7994  0  8 65 27  0
 1  2      0 103920    172 3710992    0    0 71148 73844 2792 8324  0 10 64 26  0
 1  3      0  96296    144 3718692    0    0 72692 65484 2717 7973  0  9 62 29  0
 0  1      0 107504    144 3707744    0    0 46888 52668 2361 4692  0  4 67 29  0
 0  1      0 105032    144 3710072    0    0 46712 50888 2340 4181  0  3 69 28  0
 0  2      0  98476    140 3716688    0    0 69372 69916 2707 7370  0  5 66 28  0
 1  2      0  98160    140 3716804    0    0 50496 48124 2340 5303  0  5 67 28  0
 2  1      0 104700    140 3710312    0    0 44828 45680 2230 4917  0  4 67 29  0
 1  1      0 105792    140 3709312    0    0 62788 55240 2550 6965  0  6 67 27  0
 0  2      0 107232    152 3708112    0    0 50300 56752 2445 5804  0  5 63 32  0
 0  1      0  96424    156 3718756    0    0 48420 46140 2303 4773  0  5 67 28  0
 1  1      0 100036    156 3715388    0    0 55696 56336 2456 5510  0  6 65 29  0
 0  3      0 102044    156 3713064    0    0 63980 53392 2557 7526  0  7 62 31  0
 0  2      0  94812    156 3720228    0    0 41696 47428 2253 4039  0  4 64 33  0
 0  2      0  97216    164 3717836    0    0 51868 58480 2544 5331  0  5 65 30  0
 0  1      0  98424    164 3716596    0    0 49960 50164 2329 5234  0  5 66 29  0
 0  1      0 103424    156 3711676    0    0 64468 67608 2698 7042  0  6 64 30  0
 0  3      0  96984    152 3718324    0    0 75164 75944 2857 8409  0  7 61 32  0
120105 Thu: Web hyperlinks becoming rarer, search engines, editorship

Fascinating article on Google getting gamed by SEO spammers, with a very important aspect that is mentioned in the article but otherwise underrated:

Summary: Google's early success was due to stellar search, which was based on its PageRank algorithm. Spammers have ruined PageRank, what now?

...

After all, hardly anyone links to anyone anymore, unless they’re spammers.

The really important aspect is that hyperlinks have been heavily discouraged by Google and the article above has an infographic that lists some detailed reasons for that. Broadly speaking however the reasons are:

Note: these are expanded from an earlier simplified argument in a previous post.

The really big problem is the last one: increasing what marketing experts call the stickyness of a site. The big social sites like Facebook for example heavily discourage outgoing hyperlinks because they would take away visitors, and thus provide a number of features that result in more hyperlinks within the site, among the many internal pages of their users.

Something similar applies to Wikipedia : they discourage outgoing hyperlinks and encourage internal cross references among Wikipedia pages, and I have just checked and as the article above says all external Wikipedia hyperlinks are tagged rel=nofollow (hopefully just to discourage hyperlink spammers from abusing Wikipedia pages).

In other words visitors are a desirable resource and their attention is hoarded by not publishing outgoing hyperlinks or by marking them rel=nofollow so that they don't contribute to another site's Google PageRank.

But note that there is a big difference between using rel=nofollow and not using hyperlinks at all: in the former case the source site visitor still benefits from the hyperlinks and the target site authors benefit from the incoming visitors, they just don't get the PageRank contribution. But this brings up some other aspect of the modern web:

Now before the conclusion a vital historical aspect: in the early days of the web, or even before the web in the early days of the Internet, it was very difficult to locate the content one was interested in, and therefore several directories were made listing types of content and their locations. Some of these were printed, and initially YAHOO! itself was merely a directory (as in the backronym Yet Another Hierarchically Organized Oracle!) as Wayback Machine snapshot of 1996 shows clearly. There are two problems with directories (which still exist):

Give those two issues directories became unsustainable with the growth of the web, and some people started using standard textual database search methods applying them to the web, that is to build directories algorithmically using techniques well developed for content like books or news. This usually failed because web content is largely unedited (that is, most of it is rubbish) and therefore difficult to search for relevance.

Note: this happens also in internal web sites and wiki sites (for example and famously Microsoft SharePoint installations) which usually are not afforded any funding for a human editor and therefore usually degenerate into piles of rubbish without any structure.

The idea by Google (and several others before them) was to build directories by figuring out which sites were well edited web directories for some topic or another and using their structure as the basis for their algorithmically constructured directories. The final dialectic and irony is at this point finally clear:

Note: Put another way sites win by having incoming hyperlinks from other sites and principally Google, and lose by having well chosen outgoing hyperlinks, but at the same time Google relies on those well chosen outgoing hyperlinks to build the database of incoming hyperlinks to all sites.

The problem Google has is that they should be implicitly rewarding instead of punishing the web site authors who create well edited sites with good quality outgoing hyperlinks, and this donate free editorship to Google, but that they cannot do so because their business model is to steal outgoing hyperlinks from them. Perhaps they need to adapt their business model so that outgoing hyperlinks revenue is either shared with the sites that contributed to their selection (nearly impossible) or so that sites with well chosen outgoing hyperlinks directly generate revenue from them, and that means hyperlinks embedded in the text of those sites, not separately as in Google AdSense.

For web visitors the big problem is that content that is not visible via web search is almost inaccessible, and content that is not reachable via a chain of hyperlinks (the more arduous alternative) is effectively inaccessible, and that the two are not independent, and because of that they are both becoming less useful.

The only bit of luck is that some authors still provide well edited sites with valuable outgoing hyperlinks, but those sites are difficult to find, and they tend not to last, or to be updated often, because providing well edited web sites is expensive.

Presumably the web will continue to fragment more into islands of highly connected sites with well edited content where there is some reason for such sites to be well edited other than personal dedication, or where personal dedication is motivated for the long term, performing the role of libraries, and web search like Google will become a lot less useful, as it devolves more and more into mindless advertising.

120104d Wed: UNIX with 30 users in 16MiB PC with a 25mHz clock

I was doing some nostalgia searching on old PC technology and found an entertaining USENET thread from July 1989 on how easy it is to have 30 concurrent users on a 16MiB PC with a 25mHz CPU and I feel like reproducing here one of the more revealing messages:

Newsgroups: comp.unix.i386
From: j...@specialix.co.uk (John Pettitt)
Date: 30 Jun 89 07:51:32 GMT
Local: Fri, Jun 30 1989 3:51 am
Subject: Re: How many users _really_ ?

hjesper...@trillium.waterloo.edu (Hans Jespersen) writes:
>In article <1989Jun27.074031.11...@specialix.co.uk> j...@specialix.co.uk (John Pettitt) writes:
>>How many 30+ user 386 systems do you know that run `dumb' cards ?
>Perhaps a better question is "How many 30+ user 386 systems do you
>know that run (period)."

Quite a large number.

>Apart from this reference I have seen/heard many people talking about
>putting 32 users on a '386 PC system. In my experience (although limited
>to 20 MHz, non-cached machines) it seems as though 32 is a unbelievably
>high number of users to put on a 386 system. The bus seems to be the
>bottleneck in most cases, specifically due to disk I/O. I know I'd
>feel much better proposing a 32 user mini based solution that a '386 based
>one.

Firstly a 20 Mhz non-cached 386 is not the place to start building a 32 user
system. Most of the big systems we see are based on 25 Mhz cached machines
like the top end Olivetti machines.

Secondly a large number of these systems run DPT or SCSI disks, this gives
a noticable improvment in performance.   The AT bus is only used by these
systems for I/O, memory has it's own bus so the throughput is not too bad.

Thirdly,  most of the user of this type of system are running commercial
accounting or office automation software.   The system I am typing this from
is a 25 Mhz Intel 302 and 4 engineers can kill it (well jeremyr kill it by
running VP/ix), but the office automation is a very different application.

On the 32 terminal system there will be between 8  and 24 active users most of
the time and of those only about half will be doing much more than reading
mail / enquiry access.

>How many CONCURRENT users ( of typical office automation software )
>can you respectfully support on a '386 PC ? I know this is vauge and
>will vary considerably depending on the configuration but anyone want
>to give it a shot ?

This is the real point - CONCURRENT users - my guess is with a good disk
an a good I/O board (like ours plug plug :-) and a well tuned system you
should be able to support about 16-20 wp users or about 8-10 spread sheet
users (need a 387 tho).

We have a number of customers with 32 and in one case 64 terminals on 386
systems and very usable perfomance levels.

Oh and I nearly forgot - you can _never_ have too much RAM.

>--
>Hans Jespersen
>hjesper...@trillium.waterloo.edu
>uunet!watmath!trillium!hjespersen

-- 
John Pettitt, Specialix, Giggs Hill Rd, Thames Ditton, Surrey, U.K., KT7 0TR
{backbone}!ukc!slxsys!jpp    jpp%slx...@uunet.uu.net     j...@specialix.co.uk
Tel: +44-1-941-2564       Fax: +44-1-941-4098         Telex: 918110 SPECIX G

Which reminds me that some years earlier I was able to have 3 users (barely) on a PDP-11/34 with 256KiB runnix UNIX V7 derivative 2.9BSD.

The vital qualification is that these were all text-mode users, with no GUI like X window system.

120104c Wed: Why SECURE ERASE UNIT is an important feature and some storage history

Some of my earlier and recent posts mention as an important feature of a SATA storage unit support for the SECURE ERASE UNIT command, as for example accessible via the options --security-erase and similar options of the ATA Security Feature Set of the tool hdparm.

A minor reason is that having an erase feature built into the drive is a good idea, even if the security of such a feature is somewhat dubious compared to physical destruction of the storage unit.

The major reason is that it offers an opportunity to sort-of reformat the storage unit, which may help make it more reliable, or to solve minor damage issues.

Modern storage units cannot really be reformatted because nearly all of them have IDE, that is an embedded controller, which does not support reformatting. This is also because the physical recording layer is formatted often in a very particular way, with precise embedded servo signals and tracks, which can only be made by dedicated factory equipment, and not by the signal generators inside the IDE, which can only read them.

Ancient storage units did not have IDE, and thus received directly from a controller the signal to be recorded onto the medium. For IA (previously known as IBM compatible) PCs the first storage units had a ST506 signaling interface which connected a controller card (usually a WD1003) in the PC to the rotating disk drive that acted as a recorder and player of the analog data signal produced by the controller. It was even possible by changing the controller to change recording scheme, for example from MFM to RLL 2,7 encoding and I remember being lucky that my Miniscribe 3085s nominally 80MB with a 400KB/s transfer rate (and 28ms average access time) would work with an ACB2372B controller thus delivering instead 120MB at at 600KB/s transfer rate (thanks to being able to put 27 instead of 16 sectors in each track).

Currently controllers are integrated with the storage medium and recording hardware (on the PCB usually on the bottom) and the controllers are rather sophisticated computers, usually with a 16-b or 32-bit CPU with dozen of MiB of RAM and running a small real-time operating system and hundreds of thousands or millions of lines of code to implement a high level protocol like the SCSI command set (which is used by SATA and SAS equally), and don't offer any direct access to the underlying physical device.

The controller usually offers to the host an abstract, indirect view of the device as a randomly addresses array of logical sectors that get mapped in some way or another onto the physical recording medium (and the mapping can be quite complicated in the case of flash SSD devices).

Part of this mapping is sparing where the physical device has spare recording medium that is kept in reserve in case some of it fails, and in that case the logical sectors assigned to the damaged part get reassigned to a section of the spare recording medium.

This sparing is essential as contemporary devices have hundreds of millions of logical sectors and the damage rate for physical medium cannot be zero. Unfortunately it often does not work that well, for various reasons, for example:

For all these reasons it is often (but not always) possible to refresh a storage unit to look error free and almost as-if reformatted by performing a SECURITY ERASE UNIT style operation, as that usually triggers some firmware logic that looks remarkably like the initial setup of the drive, after the recording medium has been primed at the factory.

The SECURITY ERASE UNIT command therefore usually does a bit more than simply erasing the content of all logical blocks: it effectively rescans and reanalyzers most or all of the physical recording medium, and rebuilds the internal tables that implement the logical device view on top of the physical medium.

This is fairly important for flash SSDs because it is in effect a large whole-disk style TRIM operation that resets all the flash erase blocks to empty, and resets all logical sector to SSD blocks assignments, and thus usually the whole unit back to nearly optimal performance, with minimum write amplification.

Considering how quick the process is on a flash based SSD, and that they are quite faster than a rotating storage device at restoring archives, a complete backup/security erase/reload cycle can be affordably fast, and completely refresh both the logical and physical layouts (defragment) of the storage unit.

120104b Wed: My new SSD misreports its geometry, plus other details

In my new XFS setup I have had to specify to XFS the logical sector size explicitly as 4096B because my new new SSD storage unit mispreports its geometry (from hdparm -I output:

ATA device, with non-removable media
	Model Number:       M4-CT256M4SSD2                          
	Serial Number:      ____________________
	Firmware Revision:  0009    
	Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
	Used: unknown (minor revision code 0x0028) 
	Supported: 9 8 7 6 5 
	Likely used: 9
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  268435455
	LBA48  user addressable sectors:  500118192
	Logical  Sector size:                   512 bytes
	Physical Sector size:                   512 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:      244198 MBytes
	device size with M = 1000*1000:      256060 MBytes (256 GB)
	cache/buffer size  = unknown
	Form Factor: 2.5 inch
	Nominal Media Rotation Rate: Solid State Device

Where I have put in bold type some notable details, the first of which being that the driver reports a 512B logical sector size and also a 512B physical sector size. I would rather it reported a 4KiB logical sector size and a 1MiB physical sector size, but probably the 512B report for both is to avoid trouble with older operating system versions that cannot handle well or at all sector sizes other than 512B. But then usually those older operating systems don't query drives for sector sizes either. Continuing to look at the reported features:

Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, with device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	Advanced power management level: 254
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	   *	Advanced Power Management feature set
	    	SET_MAX security extension
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	WRITE_{DMA|MULTIPLE}_FUA_EXT
	   *	64-bit World wide name
	   *	IDLE_IMMEDIATE with UNLOAD
	    	Write-Read-Verify feature set
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	unknown 76[3]
	   *	Native Command Queueing (NCQ)
	   *	Host-initiated interface power management
	   *	Phy event counters
	   *	NCQ priority information
	   *	DMA Setup Auto-Activate optimization
	   *	Device-initiated interface power management
	   *	Software settings preservation
	   *	SMART Command Transport (SCT) feature set
	   *	SCT LBA Segment Access (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	   *	Data Set Management determinate TRIM supported
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
		frozen
	not	expired: security count
		supported: enhanced erase
	2min for SECURITY ERASE UNIT. 2min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 500a075_________
	NAA		: 5
	IEEE OUI	: 00a075
	Unique ID	: _________
Checksum: correct

In the above it is nice to see that it supports SCT ERC and of course it supports ATA TRIM. It is also amusing to see that SECURITY ERASE UNIT is supposed to take only 2 minutes. My 1TB 3.5" rotating storage devices estimated it at 152 and 324 minutes, and the 500GB 2.5" rotating storage device that was in my laptop estimated it at 104 minutes. There is some advantage to bulk-erase times from large simultaneously-erased flash blocks.

120104 Wed: Switching from JFS to XFS on my data filetrees

While installing my new SSD storage unit in my laptop I have decided with much regret to switch from my favourite JFS to XFS as the main data file-system.

While I think that after JFS it is XFS that is the most convincing file-system for Linux it has many complications and pitfalls that I would really rather to stay with JFS. But JFS is not being actively maintained, and various features that are good for SSDs have not been added to it.

Currently I may have some time to help maintain it, but because of the extreme quality assurance demands for file systems I cannot really help much; it is vital for quality assurance to spend quite a bit of hardware and time resources in testing, as many file-system problems manifest under load on large setups.

Also XFS is now actively supported (in particular as to large scale quality assurance) by Red Hat for EL, as they have bought a significant subset of the development team, and EL and derivatives are important in my work activities.

I have used this non-standard set of parameters to setup the filetrees:

# mkfs.xfs -f -s size=4096 -b size=4096 -d su=8k,sw=512 -i size=2048,attr=2 -l size=64m,su=256k -L sozan /dev/sda6
meta-data=/dev/sda6              isize=2048   agcount=4, agsize=6104672 blks
         =                       sectsz=4096  attr=2
data     =                       bsize=4096   blocks=24418688, imaxpct=25
         =                       sunit=2      swidth=1024 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=16384, version=2
         =                       sectsz=4096  sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The XFS developers insist that the defaults are usually fine, and indeed most of the above are defaults made explicit. The main differences are specifying 2048B inode records instead of the smaller default, and specifying alignment requirements, in particular marking logical sectors to be considered 4096B, but also requesting RAID-like alignments on a single storage unit. The hopes are that this helps with aligning to flash page and flash block boundaries and does not cost otherwise much (the only cost I can see is a bit of space).

I have begrundgingly kept the root filetree as ext3 (under an EL5 derivative) and ext4 (under ULTS10) as usually initrd based boot sequences tend to be fragile with file-systems other than the default.

120103 Tue: SYSLINUX issues and Grml

Since a new release of the very useful Grml live CD has happened, I have downloaded it and wanted to install it on some USB portable storage device that I have, and in particular the combined 32-bit and 64-bit dual boot edition. That did not work smoothly (with various instances of the message:

Missing Operating System

because of some mistakes with setting up correctly the SYLINUX bootloader:

It actually turns out that with recent versions of SYSLINUX one can use VFAT types throughout, so the partition type can be 0x0e, and the filetree in the partition formatted and mounted as FAT32.

The other big problem was that the SYSLINUX configuration files for the 32-bit plus 64-bit dual boot did not seem to work from the USB storage device, and did not seem finished either, and anyhow I found them messy and hard to understand. So I have updated them to work more robustly and to be cleaner, and I have in the process removed the graphical menus that are hardly useful. I have put the updated version here.

Since the USB storage device uses flash memory, access times are very low and it has much faster access times than a rotating storage device like a CD-ROM or a DVD-ROM, and accordingly the running live system seems much snappier, even if a small flash drive has a single chip or two and accordingly the transfer rates are not high (around 20MB/s reading and 6MB/s writing on the one I got).

I still usually prefer bootable CDs and DVDs as they are read-only, cheap and easy to duplicate, but admittedly currently many systems currently have no CD or DVD drive, but have USB ports (and whil I have a USB DVD drive, it is somewhat bulky to carry around or to use on a stack of servers).

120102b Mon: AMD 8 CPU chip uses shared resources

I have been reading for the past few months some reviews of the new AMD FX-8150 chips with 8 CPUs, which are reported to perform less well than their intel counterparts even if they scale better and it was quite interesting to see that they are reported to share units between CPUs.

This is a bit of cheating, as it has a bit the flavour of Intel's hyperthreading. The difference is that in AMD's case only a few of the resources are shared between CPUs, while in Intel's case virtually all are (except the memory engines).

120102 Mon: Some quick tests of a recent 256GB SSD

I have recently bought a 256GB Crucial M4 flash SSD (firmware 009) for my laptop, and I have been curious about its performance profile. The laptop is an older laptop that does not support 6Gb/s SATA, so read rates are limited by the 3Gb/s maximum.

As expected the SSD unit is far less sensitive to read-ahead and transaction size settings (at least for reading) than a rotating storage unit, for example:

#  #  for N in 8 32 128 512 2048; do blockdev --setra "$N" /dev/sda; blockdev --flushbufs /dev/sda; dd bs=64k count=10000 if=/dev/sda of=/dev/zero; done
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 8.21186 s, 79.8 MB/s
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 4.49033 s, 146 MB/s
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 2.52406 s, 260 MB/s
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 2.42923 s, 270 MB/s
10000+0 records in
10000+0 records out
655360000 bytes (655 MB) copied, 2.67782 s, 245 MB/s

and there are entirely equivalent results with O_DIRECT and varying transaction sizes:

#  for N in 8 32 128 512 2048; do dd bs="$N"b count=$[1200000 / $N] iflag=direct if=/dev/sda of=/dev/zero; done
150000+0 records in
150000+0 records out
614400000 bytes (614 MB) copied, 11.946 s, 51.4 MB/s
37500+0 records in
37500+0 records out
614400000 bytes (614 MB) copied, 4.78695 s, 128 MB/s
9375+0 records in
9375+0 records out
614400000 bytes (614 MB) copied, 3.33459 s, 184 MB/s
2343+0 records in
2343+0 records out
614203392 bytes (614 MB) copied, 2.7135 s, 226 MB/s
585+0 records in
585+0 records out
613416960 bytes (613 MB) copied, 2.58665 s, 237 MB/s

In both cases, a 64KiB read-ahead or 64KiB read size deliver near the top transfer rate, thanks to the very low latency due to negligible access times. For the same reason fsck (of an undamaged filetree) is 10-20 times faster on the SSD than on the same filetree on rotating storage.

120101b Sun: An analysis of Linux file-system sources

Just discovered that someone has run several code analysis tools over the Linux file-systems sources.

The main interest are the many ways of looking at inter-module references. I haven't spotted anything notable yet.

120101 Sun: How flash memory really works

In my previous notes about how SSDs are structured I simplified a bit what is a complex picture. One of the simplifications is that I wrote that flash memory must be erased before it is written, and that's not quite the case: flash memory bits cannot be written to, they can only be set, that is one can only logical-or new data bitwise into a flash page; erasing simply re-sets all bits in a flash block to unset.

In theory this property of flash memory can be used to minimize read-erase-write cycles, because if the new content of a flash page has only set bits that were unset in the current content it can be written directly; but I don't know whether flash drive firmware checks for that.