Software and hardware annotations 2011 March

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

110331 Thu Long storage device error retries

There are several difficulties with storage setups, and many of them arise from unwise attempts to be too clever. One of these is the attempt by drive manufacturers to compensate for MS-Windows issues and for inflated expectations by customers.

One of these is the assumption that storage devices are faultless and never lose data or capacity, and making redundant copies is quite unnecessary. As a result most storage devices are set to retry failed operations for a long time in the hope of eventually forcing them through, and amazingly this applies to writes too, as many block layers and file system designs cannot easily mark some portion of a storage area as faulty and then use another one which is not faulty.

The real, fallible, storage device is then virtualized into an (pretend) infallible one which however has rather different properties, in particular as to latency from real storage devices: in particular higher latencies when a sector that has been virtualized (usually because of a fault) to another location is accessed, and enormouly higher latencies.

Because some parts of Linux have been designed by people who are used to storage being unvirtualized, and as a result do their own retries of failing operations, and since it often happens that failing operations are clustered in time (for example a cable becomes faulty) or in space (for example a small part of the surface of a platter becomes faulty) this can involve very long periods in which the system becomes unresponsive. This can be dozens of minutes, as each failing operation is first retried for 1 or 2 minutes by the storage device and then that's repeated several times by the device driver or the block layer in the kernel, or even by overeager applications.

The far better options overall is to use redundancy rather than retries to cope with failures, and acknowledge failure early, rather than over-relying on crude attempts at dataq recovery.

This particularly matters if further virtualization layers are used; for example a hardware RAID HA, or a software virtual HA in a VM.

These layers often have their own fault recover logic and lower level recovery just makes things longer. Even worse they may have their own IO operation timeouts which may be triggered by long retry times in lower levels of access; for example timeouts on a single operation might result in some layer to consider the whole device as faulty, when instead it is almost entirely working well.

One of the major issues are the very long retries done by most storage devices in an attempt to work around the limitations as to error handling in MS-Windows.

Fortunately somewhat recently an extension to some common storage protocol has become somewhat popular, and it is ERC as part of the SCT portions of SATA/SAS. This is somewhat limited as changes to the retry settings are not permanent, and as a rule the parmanent defaults can only be change, in a few cases, with timeouts on a single operation might result in some layer to consider the whole device as faulty, when instead it is almost entirely working well.

SAS and enterprise grade SATA drives have reduced retry timeouts by default, but these are still often pretty long being typically 7 seconds. Some SATA consumer grade drives have had their SCT ERC settings blocked to prevent them being substituted for the far more expensive enterprise level drives (where the different SCT ERC default is the only functional difference, the others being quality of manufacturing ones that are difficult to demonstrate).

What is a suitable retry timeout is a then a good question, and many SAS and enterprise grade drives have it set to 7 seconds which seems way too long to me. Surely more sensible than the 1-2 minutes for many consumer grade SATA drives, but still incredibly long: repeated read or write operations to the same area of the disk will usually incur only rotational latency (transfer time being neglible), and a typical 7200RPM drives will do 120RPS for perhaps around 100 retries per second. I think then that 1-2 seconds are plenty for a retry timeout, especially on storage systems with redundancy built in like most RAID setups, but also on desktop or laptop drives.

110306 Sun MBR style partition alignment script

Regrettably there are now quite a few high capacity drives with 4096 physical sectors, and this requires better alignment of data and filesystems on those drives than in the past. Large alignment granules are also suggested or required by other storage technologies, from parity RAID (not a good idea except in a few cases) to flash-memory based drives that have huge erase-blocks.

The ideal technology to achieve this is the new GPT partioning scheme implemented notably by recent versions of parted and gdisk which by default align partitions to largish boundaries like 1MiB. I prefer to set the granule of alignment and allocation to 1GiB for various reasons.

However there is still a case for using the old MBR partitioning scheme on drives with less than 2TB capacity, for example because most versions of GNU GRUB don't work with the GPT scheme.

In the MBR scheme there are several awkward legacy issues, mostly that:

The first primary partition and each logical partition have some metadata preceding them and by default this means that there is a 63×512 bytes offset from the beginning of the disk or the preceding partition.
Partition boundaries are described both in CHS and LBA terms, and the CHS address is the canonical and more intuitive one, while the one that matters but is a bit more awkward (large numbers) is the LBA one.
CHS addresses are expressed in virtual cylinders of 255×63 sectors of 512 bytes, and that is an inconvenient 16,065 sectors.

The way I have chosen after consulting various sources to reconcile using traditional MBR partitions with modern storage technology is to adopt these conventions:

Align all partition beginnings (and this partition sizes) to a power-of-2 multiple of the default CHS cylinder size, in particular 64, which gives a 32KiB alignment.
Using the c setting of fdisk and the extended command b, to set the LBA starting address of each partition to another coarser power-of-2 alignment, 512, which gives a 256KiB alignment.
Write a small Perl script that given partition sizes in GiB computes suitable partition start and end points in CHS addresses, and LBA starting points.

The very draft version of the script is:

#!/usr/bin/perl

use strict;
use integer;

my $KiB		= 1024;
my $MiB		= 1024*$KiB;
my $GiB		= 1024*$MiB;

my $SectB	= 512;
my $HeadB	= 63*$SectB;
my $CylB	= 255*$HeadB;

my $alPartB	= 32*$KiB;
my $alFsysB	= 256*$KiB;

# The first byte, origin-0, is 512, as bytes 0-511 contain
# the MBR.
my $startB	= 1*$SectB;

# Now we have as parameters either <= 4 primary partition sizes
# or >=5 sizes of which the first 3 are primary partitions and
# the rest are logical partitions. For each we want to print
# the first and last cylinders and sectors, and the start of data
# within it, given a specific *usable* size (that is, excluding
# the start of data alignment).

sub partCalc()
{
  my ($startB,$resB,$sizeB) = @_;

  my $startC = (($startB+$CylB-1)/$CylB);

  $startC += 1
    while (($startC*$CylB) % $alPartB != 0);

  my $startDataB = (($startC*$CylB+$resB+$alFsysB-1)/$alFsysB)*$alFsysB;

  my $endC = ($startDataB+$sizeB+$CylB-1)/$CylB;

  $endC += 1
    while (($endC*$CylB) % $alPartB != 0
	  && (($endC-$startC+1)*$CylB) >= $sizeB);

  return ($startC,$startDataB/$SectB,$endC);
}

my $startB = 0;

my $ARGC = 1;

foreach my $ARG (@ARGV)
{
  if ($ARG eq '')
  {
    if ($ARGC <= 4)
    {
      printf "%2d:\n",$ARGC;
    }
    else
    {
      printf "%2d: %s\n",$ARGC,"(logical partition cannot be void)";
    }
  }
  else
  {
    my $resB = ($ARGC == 1 || $ARGC >= 5) ? 63*$SectB : 0;

    my ($startC,$startDataS,$endC) = &partCalc($startB,$resB,($ARG+0)*$GiB);
    printf "%2d: %6dc to %6dc (%6dc) start %10ds\n",
      $ARGC,$startC+1,$endC,($endC-$startC),$startDataS;
    $startB = $endC*$CylB;
  }
  $ARGC++;
}

This is how I used it in a recent partitioning of a 2TB drive I recently got:

tree%  perl calcAlignedParts.pl 0 25 25 '' 5 409 464 928
 1:      1c to     64c (    64c) start        512s
 2:     65c to   3328c (  3264c) start    1028608s
 3:   3329c to   6592c (  3264c) start   53464576s
 4:
 5:   6593c to   7296c (   704c) start  105900544s
 6:   7297c to  60736c ( 53440c) start  117210624s
 7:  60737c to 121344c ( 60608c) start  975724032s
 8: 121345c to 242496c (121152c) start 1949391872s

and this is the resulting partioning scheme in CHS and LBA terms:

% fdisk -l /dev/sdc

Disk /dev/sdc: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00084029

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1              65        3328    26218080    7  HPFS/NTFS
/dev/sdc2            3329        6592    26218080    7  HPFS/NTFS
/dev/sdc4            6593      242496  1894898880    5  Extended
/dev/sdc5            6593        7296     5654848   82  Linux swap / Solaris
/dev/sdc6            7297       60736   429256608    7  HPFS/NTFS
/dev/sdc7           60737      121344   486833664   83  Linux
/dev/sdc8          121345      242496   973153184   83  Linux
% fdisk -l -u /dev/sdc

Disk /dev/sdc: 2000.3 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Disk identifier: 0x00084029

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1         1028160    53464319    26218080    7  HPFS/NTFS
/dev/sdc2        53464320   105900479    26218080    7  HPFS/NTFS
/dev/sdc4       105900480  3895698239  1894898880    5  Extended
/dev/sdc5       105900544   117210239     5654848   82  Linux swap / Solaris
/dev/sdc6       117210624   975723839   429256608    7  HPFS/NTFS
/dev/sdc7       975724032  1949391359   486833664   83  Linux
/dev/sdc8      1949391872  3895698239   973153184   83  Linux

This was achieved first by using fdisk to create the partitions with the usual n command with the given CHS start and end (after setting the c option just in case) and then adjusting the LBA starting sector of each partition with the b extended command.