Computing notes 2012 part three

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

121019 Fri: The two different models of X Windows

It may seem incredible, but even quite smart people have been baffled by the inanity of the changes made to the X Windows System architecture to introduce the so-called RANDR extension, in part because they are radical change, in part because they have been rather objectionable, and went through a series of poorly misdesigned and incomplete iterations.

As a background it used to be that the X architecture was fairly clear:

An X server process would define a display, usually but not necessarily display :0, and this was configured as a Layout as a set of input devices and a set of screens, that is output devices.
In some cases like Linux which allowed multiple servers to be run concurrently on different virtual consoles a system could have multiple displays offering completely different setups.
A screen would have a device and a monitor type for the output device associated with the device, and a set of properties, including the bit depth and bit size of the frame buffer, and the supported video modes (the intersection of those supported by the device and the monitor).
Programs would access a screen by getting a graphic context for it, which would define what kind of imaging could be done with it, such as bit depth, pixel size, dots per inch.
In the case of multiple graphics cards capable of displaying multiple frames through multiple monitors one could define multiple screens.
A device represents a frame buffer containing a frame to display on the monitor with the associated type.
In the case of graphics units capable of defining multiple frame buffers with multiple rasterizers leading to multiple outputs multiple devices could have the same BusID to indicate that they were meant to be on the same graphics unit.
A monitor or more precisely a monitor type defining the characteristics of a monitor attached to a device. The main purpose of a monitor was to define the optical properties of the displayed frame, such as its size in millimiters and its pixel density, as well as its electronic properties, most notably the maximum bandwidth of its electronics and for monitors with moving elecron beam CRTs the pause times those electronics required.

I have put several examples of X11 configuration else where elsewhere on this site and here are the outlines of three common cases:

Single frame buffer, single monitor

Section		"Monitor"
  Identifier	"TypeA"
  VendorName		"generic"
  ModelName		"LCD 23in"
  Gamma			2.2
  DisplaySize		509 286
  HorizSync		30-83
  VertRefresh		56-75
  #Bandwidth            155
EndSection

Section		"Device"
  Identifier	"Card0"
  Driver	"vesa"
EndSection

Section		"Screen"
  Identifier	"Screen0"
  Monitor		"TypeA"
  Device		"Card0"
  Subsection		"Display"
    Modes		"1920x1080" "1366x768" "1024x768"
  EndSubsection
EndSection

Section		"ServerLayout"
  Identifier	"Generic"
  Screen		0 "Screen0"
  InputDevice		"Mice"		"CorePointer"
  InputDevice		"Keyboards"	"CoreKeyboard"
EndSection

Two frame buffers, two graphics units, two monitors of the same type.

Section		"Monitor"
  Identifier	"TypeA"
  VendorName		"generic"
  ModelName		"LCD 19in"
  Gamma			2.2
  DisplaySize		376 301
  HorizSync		31-81
  VertRefresh		56-75
  #Bandwidth            155
EndSection

Section		"Device"
  Identifier	"Card0"
  VendorName		"generic"
  BoardName		"GeForce"
  BusID			"PCI:1:0:0"
  Driver		"nvidia"
  Screen		0
EndSection
Section		"Device"
  Identifier	"Card1"
  VendorName		"generic"
  BoardName		"Radeon"
  BusID			"PCI:2:0:0"
  Driver		"radeon"
  Screen		1
EndSection

Section		"Screen"
  Identifier	"Screen0"
  Device		"Card0"
  Monitor		"TypeA"
  Subsection		"Display"
    Modes		"1280x1024" "1024x768"
  EndSubsection
EndSection
Section		"Screen"
  Identifier	"Screen1"
  Device		"Card1"
  Monitor		"TypeA"
  Subsection		"Display"
    Modes		"1280x1024" "1024x768"
  EndSubsection
EndSection

Section		"ServerLayout"
  Identifier	"Layout2"
  Screen		0 "Screen0"
  Screen		1 "Screen1"	RightOf "Screen0"
  InputDevice		"Mice"		"CorePointer"
  InputDevice		"Keyboards"	"CoreKeyboard"
EndSection

Two frame buffers, single graphics unit, two monitors of very different types

Section		"Monitor"
  Identifier	"TypeA"
  VendorName		"generic"
  ModelName		"LCD 13in"
  Gamma			2.2
  DisplaySize		286 178
  HorizSync		50
  VertRefresh		60
  #Bandwidth            83
EndSection
Section		"Monitor"
  Identifier	"TypeB"
  VendorName		"generic"
  ModelName		"LCD 24in or projector"
  Gamma			2.2
  DisplaySize		518 324
  HorizSync		24-94
  VertRefresh		48-85
  #Bandwidth            250
EndSection

Section		"Device"
  Identifier	"Card0Fb0"
  VendorName		"generic"
  BoardName		"nVidia"
  BusID			"PCI:1:0:0"
  Driver		"nvidia"
  Screen		0
EndSection
Section		"Device"
  Identifier	"Card0Fb1"
  VendorName		"generic"
  BoardName		"nVidia"
  BusID			"PCI:1:0:0"
  Driver		"nvidia"
  Screen		1
EndSection

Section		"Screen"
  Identifier	"Screen0"
  Device		"Card0Fb0"
  Monitor		"TypeA"
  Subsection		"Display"
    Modes		"1366x768" "1280x800" "1024x768"
    Depth		16
  EndSubsection
EndSection
Section		"Screen"
  Identifier	"Screen1"
  Device		"Card0Fb1"
  Monitor		"TypeB"
  Subsection		"Display"
    Modes		"1920x1200" "1280x1024" "1024x768" "800x600"
    Depth		24
  EndSubsection
EndSection

Section		"ServerLayout"
  Identifier	"Layout01"
  Screen		0 "Screen0"
  Screen		1 "Screen1"	RightOf "Screen0"
  InputDevice		"Mice"		"CorePointer"
  InputDevice		"Keyboards"	"CoreKeyboard"
EndSection

Originally each screen could only share the input devices with other screens, and windows could not be displayed across two screens nor could they be moved from one screen to another, in part because multiple-screen systems were rare and expensive, in part because screens could have extremely different characteristics, such as color depth, or pixel density, such as a monochome portrait monitor and a color television.

However as the memory sizes of graphics units increased, and the cost and size of monitors decreased, especially with LCD monitors, systems with two (or more) identical (or nearly identical) monitors became common, and so the desire to be able to regard multiple monitors as interchangeable tiles.

Therefore a somewhat hacked solution was added, in the form of the XINERAMA protocol extension and associated mechanism. The protocol extension allowed applications to query the X server as to the geometry of the various screens and to handle them as if they were sections of a bigger meta-screen, with the positions of the screens within it to be specified in the X server's Layout section where they are listed.

It was a somewhat inelegant retrofit, putting the burden of dealing with the situation on applications, but since the main application code involved was in window managers and libraries rather than in end user code, it was mostly painless, and respected the overall successful philosophy of the X Window System to offer simple mechanisms and leave policies to user applications.

The above architecture was very flexible in many ways, in particular allowing very diverse devices and monitors to coexist in a display, but had a limitation: all the elements above, including the number and type of monitors and devices, and their characteristics, had to be statically defined in the X server configuration.

My solution to this was simply to define the maximum number of devices and monitors that I wanted to use in the worst case, and the list of screen modes that encompassed most of the actual devices and monitors I would be using, and for out-of-the-ordinary case just run a custom-configured X server as a separate display on another virtual console.

Otherwise the X server implementation of Xinerama could have been reworked to support enabling and disabling devices and monitors, and adding and deleting modes.

The one limitation that could not be overcome was to introduce the ability to rotate the screen, as that requires additional code to rotate the frame buffer.

Someone then decided to add another protocol extension to request screen rotation, and to overload it with dynamic screen addition and removal and resizing. However for whatever stupid reason they decided to add to rotating and resizing a completely new model which was whacked into the X server to coexist uneasily with the previous one, and which is rather uglier:

A display is made of a number of autodiscovered input devices, which cannot be explicitly configured, and exactly one screen with a frame of 8192×8192 pixels.
A screen can only be supported by one device with exactly one frame buffer.
The one device on a single graphics unit could have multiple regions mapped to outputs by way of crtcs.
Those outputs have monitor instances attached to them, can be enabled and disabled dinamically, and their position, pixel density, gamma, and the characteristics of the monitor can be changed dynamically.
In the initial configuration file the position of the outputs are defined in the monitor parameters, with respect to other outputs.

A sample static configuration with two identical output monitors could look like:

Section		"Monitor"
  Identifier	"Monitor0"
  VendorName		"generic"
  ModelName		"LCD 19in"
  Gamma			2.2
  DisplaySize		376 301
  HorizSync		31-81
  VertRefresh		56-75
  Option		"Primary"		"true"
  Option		"PreferredMode"		"1280x1024"
EndSection
Section		"Monitor"
  Identifier	"Monitor1"
  VendorName		"generic"
  ModelName		"LCD 19in"
  Gamma			2.2
  DisplaySize		376 301
  HorizSync		31-81
  VertRefresh		56-75
  Option		"Primary"		"false"
  Option		"PreferredMode"		"1280x1024"
  Option		"Right-Of"		"DVI1"
EndSection

Section		"Device"
  Identifier	"CardR"
  VendorName		"generic"
  BoardName		"generic"
  Option		"Monitor-DVI1"		"Monitor0"
  Option		"Monitor-VGA1"		"Monitor1"
EndSection

Section		"Screen"
  Identifier	"ScreenR"
  Device		"CardR"
  # 'Monitor' line in RANDR mode ignored.
EndSection

Section		"ServerLayout"
  Identifier	"LayoutR"
  Screen		"ScreenR"
  # 'InputDevice' lines in recent servers ignored.
EndSection

The equivalent dynamic configuration could be achieved with:

xrandr --newmode 1280x1024@60 108.0 \
  1280 1328 1440 1688 1024 1025 1028 1066 +HSync +VSync
xrandr \
  --addmode DVI1 1280x1024@60 \
  --addmode VGA1 1280x1024@60
xrandr \
  --output DVI1 --primary   --mode 1280x1024@60 \
  --output VGA1 --noprimary --mode 1280x1024@60 --right-of DVI1
xrandr \
  --output DVI1 --dpi 100 \
  --output VGA1 --dpi 100

The above only applies to relatively recent versions of RANDR, versions 1.2 and 1.3; previous versions are hardly usable except in narrow circumstances.

The static configuration is appallingly designed with the particularly silly idea of putting the geometry relationship among the outputs in the Monitor sections.

Probably RANDR is inspired by nVidia's TwinView which however is very much better designed, and is compatible with the old style X architecture.

121014 Sun: Using two ports for the same protocol

Because of the vagaries of computing history some important network protocols are assigned a fixed port number, but are used for very different types of traffic.

In particular application protocols like SSH and HTTP are often used as if they were basic transport protocol like UDP or TCP, with other protocols layered on top, often to help with crossing network boundaries where NAT or firewalls block transport protocols.

So for example SSH is used both for the interactive sessions for which it was designed, and for bulk data transfer for example with RSYNC.

This poses a problem in that the latency and throughput profiles of the protocols layered on top of SSH and HTTP can be very different, making it difficult for traffic shaping configurators like my sabishape to classify traffic correctly.

There is one way to make traffic shaping able to distinguish the different profiles of traffic borne by the same application protocol, and it is to assign to them different ports, as if they were different application protocols.

For example to use port 22 for interactive SSH traffic, but port 522 for RSYNC-over-SSH traffic. Similarly to use port 80 for interactive HTTP browsing, but port 491 for downloads.

Some of these ports are not NAT'ed or open by default in firewalls, and it is a bit sad to have to have independent local conventions but the benefit, especially avoiding the huge latency impact of bulk traffic on interactive traffic, is often substantial, and the cost is often very small, as many server dæmons can easily listen on two different ports for connections.

120728 Sat: Random number generators and virtual machines

Having read my note about SSL issues and random number generators a smart correspondent has send me an email to point that that such problems, and in general time-dependent problems, are made much worse by running application code, in particular SSL, but not just, inside virtual machines.

Virtual machines disconnect to some extent virtual machine state from real machine state for arbitrary periods (even if brief) as the VMs gets scheduled, and this completely alters the timings of events inside VMs, and in a rather deterministic way, as schedulers tend to be so.

This and other aspects of virtual machines can starve the entropy pool of entropy or make pseudo random number generation much more predictable, thus weakening keys.

Some virtual machine platforms offer workarounds for this, but this is yet another reason why virtual machines are usually not a good idea.

120721b Sat: Impressions of the Dell P3211H monitor

The Dell P2311H monitor with a 23in display (510mm×287mm) belongs to the value range of Dell monitors and these are my impressions.

I got this because it was part of a package with a nice Dell Optiplex desktop. The display has a diagonal of 23" or 545mm (267mm×475mm) and it has a full resolution of 1920×1980 pixels, using a TN display. Things that I liked:

The stand and build quality are good, even if they are in plastic and not metal unlike the more expensive Dell monitors.
The picture is quite readable and the Dell VGA to digital converter builtin seems to work pretty well.
It like most office Dell monitor a convienient built in USB hub with a reasonable number of ports.
The price is reasonable.

The things I liked less:

Like most recent Dell monitors the power button is exactly at the bottom right coerner, so grasping the monitor to adjust it physically will often result in accidentally switching it off.
The TN display has terrible viewing angles, especially vertically. Even without moving my head there is a large and obvious difference in color temperature and contrast between the top and the bottom of the screen, because the screen is tall enough that the viewing angle differs.
This means that windows almost as tall as the screen have this rather distracting gradient, and portrait mode is unusable. This is bad compared even to many other TN displays, never mind IPS displays.
Colors seem a bit unrealistic, but whether this is because of the viewing angles or not is not clear, and not very noticeable anyhow.

Overall I think that the similar model U2312HM is vastly preferable as the display is much better and the cost is not much higher. Even the smaller and cheaper IPS225 is much preferable.

120721 Sat: Impressions of the LG IPS225 monitor

I have recently been using an LG IPS225 monitor with a 21.5in display (477mm×268mm) and what I liked about it:

The IPS display works pretty well, delivering good colours and pretty good viewing angles (allowing portrait mode use), and fairly decent contrast.
It is light and the built-in stand is easily detachable and it has VESA mounting sockets so a better stand can be used.
The full HD resolution of 1920×1080 is convenient.
It has not just both DVI and VGA sockets, but also HDMI.
The delivery box is small and light, making the monitor easily transported.
The control menus are fairly well designed.
The autosync works fairly well.
The price is good.

The things I liked less:

The default stand is terrible, the usual foot-style stand of value monitors.
The IPS display shows blacks with a violet tint from wide angles, even if this is only noticeable when the screen saver is on and the whole display is black.
The IPS display is 18 bit native with dithering simulating 24 bit colors. It seems unnoticeable to me.

Overall I think that it is a good monitor, and for the price it is very good. The stand is terrible, but it is easy to find decent VESA mount stands, in particular those that allow rotating it into portrait mode.

Since it has a 21.5in diagonal its size is much more suitable than that of monitors with a 23in or 24in display for portrait operations, which tend to be too tall, and I think it works very well in Portrait mode, which I think is usually preferable, even if it is a bit too narrow (just like it is not a bit short when in landscape mode) because of the usual skewed aspect ratios.

A smart person I know also bought this model and is also using it (only) in portrait mode, having chosen carefully.

In its class the LG IPS225 is amazing value.

120622 Fri: Data structures and control structures

Today a smart person spotted that I sometimes write for loops with a single repetition, and asked me why. There is more than one reason, and one is somewhat subtle, and it is in essence to write something similar to a with statement from Pascal and similar languages, which is used to prefix a block with the name of a datum and operate on it implicitly.

It is part of a hierarchy of control structures that is parallel to a hierarchy of data structures, as follows:

Control and data structures
Data structure	Control structure
constants	expression
variables	assignment
records	(`with`) block
arrays	`for`
lists	`while`
trees	recursion
acyclic graphs	iterators
graphs	closures

The above table (which is slightly incomplete) is in order of increasing data structure complexity, and the corresponding control structure is the one needed to sweep the data structure, that is to make full use of it. Programming consists largely of designing data structures and then various types of sweeps through them.

The block is the control structure appropriate for manipulating a record, and here are two example in C and in Pascal:

struct complex { float re,im; };

const float cmagn(const struct complex *const c)
{
  {
     const float re = c->re, im = c->im;

     return (float) sqrt((re*re) + (im*im));
  }
}

TYPE complex = RECORD re, im: REAL; END;

FUNCTION cmagn(c : complex): REAL;
BEGIN
  WITH C
  BEGIN cmagn = sqrt((re*re) + (im*im)); END;
END

The intent of the above is to make clear to both reader and compiler that the specific block is about a program section specifically about a given entity c. Similarly sometimes I write in my shell scripts something like:

for VDB in '/proc/sys/vm/dirty_bytes'
do
  if test -e "$VDB" -a -w "$VDB"
  then
    echo 100000000 > "$VDB"
  fi
done

which is equivalent to but perhaps slightly clearer than:

{
  VDB='/proc/sys/vm/dirty_bytes'

  if test -e "$VDB" -a -w "$VDB"
  then
    echo 100000000 > "$VDB"
  fi
}

The version with for also allows me to comment out the value if I want to disable the setting of that variable.

But the real reason is to convey the notion that the block is about VDB and it is a bit more emphatically clear with the for than with the generic { block.

There are other cases where I slightly misuse existing constructs to compensate for the lack of the more direct ones, both of them after some ideas or practice by Edseger Dijkstra.

In the chapter he wrote for Structured Programming he introduced the idea of using goto labels as block titles, in outline:

extern void		  *CoreCopy(
  register void		  	*const to,
  register const void		*const from,
  register const long unsigned	bytes
)
{
  copySmallBlock:
    if (bytes < ClusterBEST)
    {
      CoreBYTECOPY(to,from,bytes);
      return to;
    }

  copyHead:
    if (bytes >= ClusterDOALIGN)
    {
      const long unsigned	    odd;

      if ((odd = ClusterREM((addressy) to)) != 0)
      {
	CoreODDCOPY(to,from,odd = ClusterBYTES - odd);
	bytes -= odd;
      }
    }

  copyClusters:
    CoreCLUSTERCOPY(to,from,ClusterDIV(bytes));

  copyTail:
    CoreODDCOPY(to,from,ClusterREM(bytes));

  return to;
}

I gave up on the practice of using labels as section titles because most compilers complain that such labels are unused in goto statements.

Edseger Dijkstra also introduces in A discipline of programming the notion that unhandled cases in if statements are meaningless so that this program fragment is meaningless if n is negative:

if
  n >= 0: s = sqrt(n);
fi

Sometimes I use while and nontermination to indicate a similar effect, relying on the property of while that it is a precondition falsifier thus for example writing in a shell script something like:

for P in ....
do
  while test -e "$P"
  do rm -f "$P"
  done
done

120621 Thu: Some possible causes for SSL connections errors to Apache

Thanks to mod_ssl the Apache web server can support SSL connections, but since these involve encryption they can suffer from somewhat subtle issues.

These issues are usually related to the rather complicated X.509 certificates but sometimes they are caused by performance problems, as SSL connections require significant numbers of random bytes for encryption keys and then processing time to encrypt data.

There are in particular two cases in which random number generation can cause failed connections:

mod_ssl defaults when running over Linux to get random data from /dev/random which is the kernel's entropy source fed from internal kernel even intervals. When events don't happen frequently this source can stall and therefore connection startup can take a long time.
The solution is to use the SSLRandomSeed directive to use a random generator with bounded latency, even one with lower quality randomness like /dev/urandom.
mod_ssl defaults to creating a new key per every new SSL connection, but sometimes web clients create a group of related connections very quickly, for example to GET all the objects on a page, and this can overload the session state system.
The solution is to use the SSLSessionCache shm: directive to tell mod_ssl to cache the keys and encryption states of related connections in a shared table.

120609 Sat: Very, very slow disk writes because of misconfigured parameters

In the past few days my laptop has shown signs of very, very slow writing, at less than 1MiB/s, while at the same time its flash SSD drive when tested with hdparm would read at nearly 300MiB/s. Doing some tests with dd showed that writing with oflag=direct would run at the usual write speed of just over 200MiB/s.

This clearly pointed at some issue with the Linux page cache and after some searching I found that the kernel parameters vm/dirty_... were all 0. Normal writing speed were restores by setting them to more appropriate values like:

vm.dirty_background_ratio = 0
vm.dirty_background_bytes = 900000000
vm.dirty_ratio = 0
vm.dirty_bytes = 1000000000

But I had setup /etc/sysctl.conf with appropriate values, and anyhow eventually those parameters changed back to zero. After some investigation the cause is that the /usr/lib/pm-utils/power.d/laptop-mode scripts would be run by the pm-utils power management logic. Normal behaviour was restored by listing laptop-mode in the file /etc/pm/config.d/blacklist.

The problem with laptop-mode is that it sets the vm/dirty_... parameters in the wrong order, as setting the parameters that end in _bytes zeroes the parameters with a similar name ending in _ratio, as they parameters ending _ratio are supposed to be set first, and then those _bytes if available.

The reason for that is that eventually some Linux kernel contributor realized the stupidity of setting the flushing on a percentage of the amount of memory available rather than a fixed amount (usually related to IO speed), so provided a second set of settings for that, with the automagic side-effect to zero the overriden settings for percentage, to avoid ambiguity.

Unfortunately if the order in which the settings are made is wrong it is possible to end up with zeroes in most of them, which causes the page cache code to behave badly.

I had written a much better change in which values over 100 in the old settings would be interpreted as a maximum number of dirty pages, and also fixed some ridiculous other automagic side effects.

120607 Thu: Changing the mounting options of the "root" filetree

For historical reasons I have installed Ubuntu 12.04 on a laptop in an ext4 filetree, and I have decided that I would rather mount it in data=writeback mode with barrier=1 than the default data=ordered mode with barrier=0 (something that is even more important for the ext3 filesystem).

This has been frustrating because in most contemporary distribution the Linux kernel boots into an in-memory block device extracted from a initrd image, and the root filetree is mounted by a script in that filetree using hardcoded parameters at first and then parameters copied from /etc/fstab, and this can be rather inflexible as previously noted.

Therefore at first that filetree was mounted with data=ordered then it is remounted with data=writeback which fails because the data journaling mode cannot be changed on remount.

However unlike other distributions the startup script in the Ubuntu 12 /init script in the initrd scans the kernel boot line for various parameters as in:

....
# Parse command line options
for x in $(cat /proc/cmdline); do
        case $x in
        init=*) ....
        root=*) ....
        init=*) ....
        root=*) ....
        rootflags=*) ....
        rootfstype=*) ....
        rootdelay=*) ....
        resumedelay=*) ....
        loop=*) ....
        loopflags=*) ....
        loopfstype=*) ....
        cryptopts=*) ....
        nfsroot=*) ....
        netboot=*) ....
        ip=*) ....
        boot=*) ....
        ubi.mtd=*) ....
        resume=*) ....
        resume_offset=*) ....
        noresume) ....
        panic=*) ....
        quiet) ....
        ro) ....
        rw) ....
        debug) ....
        debug=*) ....
        break=*) ....
        break) ....
        blacklist=*) ....
        netconsole=*) ....
        BOOTIF=*) ....
        hwaddr=*) ....
        recovery) ....
        esac
done
....

Among them there is rootflags=.... which can be set. But to set the parameters for mounting the root filetree three steps may be required:

Ensure that the root, rootflags and rootfstype are set on the boot line.
Ensure that the same values appear in the /etc/fstab of the currently running system.
On some distributions update the initrd image as that copies into it the /etc/fstab options.

120530 Wed: OS6 and BCPL, the seed for most PC operating systems

A little known but historically very important operating system was OS6 developed at the Oxford Computing Laboratory by Christopher Strachey and collaborators around the year 1970.

It is important because the Xerox Alto operating system was derived from it:

Alto OS (1973-76): I designed (and, with Gene McDaniel and Bob Sproull, implemented) a single-user operating system, based on Stoy and Strachey's OS6, for the Alto personal computer built at PARC [14, 15a, 22, 38, 38b]. The system is written in BCPL and was used in about two thousand installations.

The Alto was the first modern workstation, inspired by Alan Kay's goal of a Dynabook.

OS6 is also important because being written in BCPL it popularized that language in the USA. From BCPL Ken Thompson derived first B (which was used to implement the Thoth operating system that eventually evolved into the V operating system and also QNX) which evolved into C and OS6 itself seems to have inspired several aspects of UNIX, as the OS6 paper on its stream based IO and filesystem shows.

120507 Mon: Continuing to appreciate a good monitor

I have been going around a bit with my older Toshiba U300 laptop and I have using its built-in display, which as most laptop displays has poor color quality and narrow viewing angles, I hopes because that reduces its power consumption.

Since I have been back to my desk, where I use my laptop as my main system but with an external monitor and keyboard and mouse. This not only avoid wearing out the builtin ones, but they tend to be more comfortable.

In particular the monitor: not only the display of my Philips 240PW9 monitor is much larger, but viewing angles and in particular the quality of colors are amazingly good as previously noted, something that I appreciate more after using for a while the builtin display of the laptop.

120506 Sun: Amazing price drop for 256GB flash SSD

While browsing to see the current state of disk prices, I noticed that the 256GB flash SSD I bought some months ago for around £290 now can be bought for £180.

That's a remarkably quick drop in price, probably also due to the restoration of some hard disk production after flooding in Thailand. I am very happy with my SSD, which I manage carefully as to endurance, not just because it is quite fast, epecially in randoma access workloads such as metadata intensive ones, but also because I need to worry a lot less about bumping it while operating, as it does not have mechanical parts, never mind a high precision low tolerance disk assembly spinning 120 times per second.

120503 Thu: APM for desktop drives and load cycle count

I have just had some issues with the main disk of my minitower system, and I had a look at its state and while most of its status as reported by smartctl -A is good the load cycle count is very high:

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.1.0-3-grml-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       19503
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   074   056   025    Pre-fail  Always       -       7911
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       128
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       9602
 10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   001   001   000    Old_age   Always       -       285014
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       114
191 G-Sense_Error_Rate      0x0022   100   100   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0002   064   059   000    Old_age   Always       -       33 (Min/Max 15/41)
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       107
200 Multi_Zone_Error_Rate   0x002a   100   100   000    Old_age   Always       -       6548
223 Load_Retry_Count        0x0032   001   001   000    Old_age   Always       -       285014
225 Load_Cycle_Count        0x0032   072   072   000    Old_age   Always       -       285151

In the above report also note that the reallocated counts are zero, that is no permanently defective sector was found, and all the IO errors that I have seen have been transient. For comparison these are the relevant rows for the other 3 drives in that mini-tower:

# for N in b c d; do smartctl -A /dev/sd$N; done | egrep -i 'power_on|start_stop|load_cycle'
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       1891
  9 Power_On_Hours          0x0012   097   097   000    Old_age   Always       -       25505
193 Load_Cycle_Count        0x0012   098   098   000    Old_age   Always       -       2906
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       444
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       10499
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       669
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1482
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       9798
225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1486

From looking at the growth of the load cycle count over 30m on the affected disk it looks like that there is a load cycle every 2 minutes, or 30 times per hour, and indeed the total load cycle count is around 30 times the number of power on hours.

That is a well-known issue for laptop drives (also here and also here) but it seemed to affect only a few desktop drives, notably low-power ones like WD Green drives.

The drive affected in my case is a Samsung Spinpoint F3 1TB (HD103SJ) which is a performance oriented desktop drive. It must be replaced, as a load cycle count of almost 300,000 is very high for a desktop drive.

While I have a script that sets parameters I like for various machines and drives, the relevant lines used to be:

  # Must be done in this order, '$DISKL' can be a subset of '$DISK[SA]'.
  case "$DISKS" in ?*)                  hdparm -qS 0            $DISKS;; esac
  case "$DISKA" in ?*)                  hdparm -qS 0            $DISKA;; esac
  case "$DISKL;$IDLE" in ?*';'?*)       hdparm -S "$IDLE"       $DISKL;; esac
  case "$DISKL;$APMD" in ?*';'?*)       hdparm -B "$APMD"       $DISKL;; esac

(where $DISKS is SCSI disks, $DISKA is PATA disks, and $DISKSL is laptop-style disks). I had thought that setting the spindown timer to 0 would disable completely power management, but that obviously is not the case at least with some drives, so I have added a specific setting to disable APM by default:

  # Must be done in this order, '$DISKL' can be a subset of '$DISK[SA]'.
  case "$DISKS" in ?*)                  hdparm -qS 0 -qB 255    $DISKS;; esac
  case "$DISKA" in ?*)                  hdparm -qS 0 -qB 255    $DISKA;; esac
  case "$DISKL;$IDLE" in ?*';'?*)       hdparm -S "$IDLE"       $DISKL;; esac
  case "$DISKL;$APMD" in ?*';'?*)       hdparm -B "$APMD"       $DISKL;; esac

The previous default value for APM level was 254, which is the lowest setting that still enables APM, and on most drives that does not activate load cycles, but obviously on this drive it did.

Yet another small but vital detail to check and a parameter to set when buying or installing a drive, like SCT ERC.

120429 Sun: The mail store problem

Recently I was asking about the boundaries of some support areas and I was relieved that these did not involve any user-visible mailstores, and that the e-mail services were done by some unfortunates with Microsoft Exchange.

When I expressed relied explaining that there are usually very few mailstores that scale, especially those with mail folders in the style of MH or Maildir ones, the rather smart person I was discussing with surprised me by saying that they had no problems with a largish Dovecot based mailstore.

Which to me means that even smart persons can be quite optimistic as to the future, and that the mailstore issue is indeed often underestimated.

The mailstore issue is the issue of how to archive, either short or long term, e-mail messages, which are usually relatively small individuall, and can accumulate in vast numbers. This issue has become larger and larger for three very different reasons:

The vast increase in e-mail traffic: Some widely reported statistics say that average office workers receive 200 messages per day, most of them purely for information.
The vast increase in e-mail message retention: By itself much increased rates of message traffic would not be an archival issue if those messages were deleted after being read. But many people retain most messages, just in case, and since e-mail has been popular for years, many people have years of retained mail messages (I have a few 30 year old messages).
Central mail stores because of IMAP: It used to be that most or many e-mail messages would be downloaded or copied by the MUA from the MDA mailstore (often using POP3) which means that archival of mail messages was part of the issue of personal filestore management, but currently many users configure their MUAs to keep all archived mail messages in the MDA's mailstore accessing them by IMAP.

The combined effect is that MDA mailstores often grow because of the volume of traffic, many users retaining messages for years, and keeping them in the central mailstore instead of their own filestore.

This growth is similar to that of filestores in general, but for mailstores usually things are worse because of the sheer volume of traffic, as few users manage to create or receive hundreds of files per day.

Now the mailstore issue is that the cost profile of mailstore operations is extremely anisotropic, because there are two entirely distinct modes:

Most accesses are to a few recent messages, and most messages are accessed only a few times soon after being received, and then neither read nor written ever afterwards. Most of these accesses are low frequency, because the user reading messages have a limited reading speed.
There are infrequent but regular access to the whole mailstore, or to whole folders, usually for backups or for indexing and searching of message contents, and these involve accesses to long archived and never accessed again messages.

Even smart persons seem to consider only the cost of low-frequency, user-driven access to recent messages, and conclude that there is no mailstore problem. But the mailstore problem is related to bulk programmatic access by system administrators, for backup or indexing, and by users for searching.

The difficulty with bulk accesses is that they impact collection of very numerous, rather small messages, which have been received at widely different times, and thus can require a very large amount of random access operations.

The mailstore issue is very similar to the much wider issue of filetrees with many small files, with the added complications of the high rates of message arrival, and the far more common bulk searches.

The main reason why the mailstore issue has a large cost is the usual one that ordinary rotating disk storage with mechanical positioning devices devices are largely one-dimensional and have enormous access time anisotropy across their length, which means:

Random access to messages is extremely expensive, but as long as ordinary message access is low frequency user driven to the most recent messages it can be scaled by scaling the number of positioning devices.
It is difficult to minimize average inter-message distance in order to minimize the random accesses during bulk operations, because of the extreme interleaving of arrival of messages by user and folder or topic.

The second point is the biggest issue, because it makes it hard to put messages along the length of a disk so their physical locations correlate with some useful logical ordering, such as by user or by folder or by topic.

The most common storage representations for mailstores all have problems with this:

Messages are individual files in various directories

This can be realized in two different ways as to directories: directories represent logical partitions of the mailstore, for example by user and by folder, or directories represent physical partitions of the mailstore, for example by arrival time, and there are indices for locating all messages belong to a user or a folder or a topic.

This is the MH or Maildir or old-type news-spool style layout.

Message collections are individual files in various directories

In other words messages are considered log entries, and they are logged by user, folder, or topic, with further classification by directories, and perhaps with in-band or out-of-band indices.

This is the classic mailbox style layout or new-type circular news-spool layout.

Messages are records in a database

This is a variant on putting a list of messages in a file, where they are stored inside the tables of some kind (usually relational) of database.

It is then up to the DBA to choose a suitable physical layout, and as a rule it is similar indeed to a list of messages in a file.

Of these the most popular and at the same time the worst is the one file per message, because of two bad reasons, both related to updates:

It is very convenient for lazy or differently competent application programmers to use the filesystem as if it were a database manager, and use directories as sets of records stored as files. This means that the filesystem is burdened with the inappropriate task of handling well large collections of small records, and the application writer does not need to worry about updates, locking, deletion, searching.
Because of historical accident it is not easy to update parts of files in place on some filesystems (notably older versions of NFS), therefore it is tempting to split a collection of messages into individual files.

There are however two big problems with mail stores implemented as file-per-message:

Message collections are rather naturally organized as logs as they tend to arrive over time, and only the most recent and a few of the older ones are accessed with any frequency. This is a particularly damning point because MH style mailstores are particularly unsuitable for logs.
Filesystem design have nearly being always aimed at small collections of large files mostly because of technological constraints, usually to avoid implementing filesystem metadata as a database capable of handling well large collections of small records. Even worse storage technology improvement tend to make improve sequential streaming rather than random access, making this point even stronger.

Then there is the question of why ever mailstores with one file per message are popular, and it is mostly about them being so very tempting despite these huge issues, and there are a couple of cases where these huge issues have a small cost or the advantages are more useful:

Small mailstores with one file per message don't seem to perform quite as badly as large ones, and most mailstores start small and then grow, therefore when a new mail system is installed the type of mailstore does not seem to matter.
In ancient times there was a sharp distinction between spooling and archival mailstores, that is those for new and transient messages as distinct from those for collection of past messages. For example the Inbox into which incoming messages are delivered, and from this they are saved into topical or historical archives.

Unfortunately many if not not mailstores grow a lot even the spooling ones as many email users no longer move messages to archival message connections, but leave all messages in the inbox and rely on search and indexing instead.

Interestingly the mailstore issue happened several years earlier with NNTP servers, where newsspools, which used to be small and transient, became persistent and transier as news message volumes increased and most users stopped copying messages to archival files and relied on accessing newsgroup history and searching it. The same problems indeed occurred as this paper details:

Another problem one can face in maintaining news service is with the design and performance of most standard filesystems. When USENET first started out, or even when INN was first released, no one imagined that 9000+ articles/day would be arriving in a single newsgroup. Fortunately or unfortunately, this is the situation we now face. Most filesystems in use today, such as BSD's FFS (McKusick, et al., 1984) still use a linear lookup of files in directory structures. With potentially tens of thousands of files in a single directory to sort through, doubling the size of the spool (for example, by doubling the amount of time articles are retained before they are expired) will place considerably more than double the load on the server.

The solution has been switching to many messages per file, using a log style structure (circular as usually news messages expire after a limited time) with ain internal structure that avoids using expensive filesystem metadata as an external structure:

A number of other programming efforts are under way to address filesystem limitations within INN itself. One of several proposals is to allocate articles as offset pointers within a single large file. Essentially, this replaces the relatively expensive directory operations with relatively inexpensive file seeks. One can combine this with a cyclic article allocation policy. Once the file containing articles is full, the article allocation policy would "wrap around" and start allocating space from the beginning of the storage file, keeping the spool a constant size.

The Cyclic News File System (CNFS)[5] stores articles in a few large files or in a raw block device, recycling their storage when reaching the end of the buffer. CNFS avoids most of the overhead of traditional FFS-like[11] file systems, because it reduces the need for many synchronous meta-data updates. CNFS reports an order of magnitude reduction in disk activity. CNFS is part of the INN 2.x Usenet server news software. In the long run, CNFS offers a superior solution to the performance problems of news servers than Usenetfs.

Many years ago a large USENET site switched their newsspool to Network Appliance servers because their WAFL filesystem is itself log-structured and thus implicitly and to some extent turns message per file archives into something very similar on disk to log-structured files:

So, you've got 10,000 articles/day coming into misc.jobs.offered, in fact that's old, I don't know how many are coming in these days, maybe its 15,000 or so, I wouldn't be surprised if this were the case. And any time you want to look up attribute information on any one of them you need to scan linearly through the directory.
That's really bad. It's really much better to be able to scan for the file using a hashed directory lookup like you have in the XFS file system that SGI uses, the Veritas file system that you can buy for Sun, HP, and other platforms, or in our case, what we did was use the WAFL file system on the Network Appliance file servers.

My conclusions from this is that dense representations for mailstores are vastly preferable to message per file ones, except perhaps for small, transient mailspools (an indeed most MTAs use message per file mailspools) but that since currently most email users keep their files in their Inbox thanks to email access protocols like IMAP and often don't sort them by hand into archives, the ideal structure of a mailstore, like for newspools, is that of files with internal structure containing many messages, as the internal structure is likely to be far more efficient, especially in terms of random accesses, than that of a directory with one file per message.

Which particular type of many messages per file structure is an interesting problem. The traditional mbox structure with in-band boundaries has some defects but it has the extremely valuable advnatage of being entirely text-based and therefore easily searchable and indexable with generic text based tools, and therefore it seems to be still to be recommended as the default, perhaps with a hint to limit the size of any any mbox archive to less than 1/2 second of typical sequential IO, for example around a few dozen MiB on 2012 desktops.

An alternative used by GoogleMail is to store message in log-structured files such as those supported by their GoogleFS. where large files containing many messages are accessed using parallel sequential scans.

The alternative is to use a DBMS as they are specifically designed to handle many small data items. Unfortunately the most popular mail system that uses a DBMS to storage messages is Microsoft Exchange and it has had a very large number of issues due to a number of mistakes in its implementation.

Several recent mailstore server implementations like Citadel (which uses BDB), or Apache's James, Open-Xchange Server Zarafa, DBMail and some proprietary ones have DBMS backends.

120428 Sat: The continuing limitations of the Debian packaging tools

While the Debian™ project continues to be strong, and some of the more restrictive practices have been improved, from a technical point of view the Debian distribution continues to have several bad aspects as previously remarked and most of these have to do with rather objectionable choices in packaging policy and tools. As to the packaging tools, that is DPKG and associated .deb tools, I have been following the comical story of how they have been extended with enormous effort and controversy to allow the installation of two packages with the same name but for different architectures, just as the toolset was previously extended to handle checksum verification of package files.

As as the extension for multiple architectures is rather incomplete, as it leaves out the very useful and important ability to have different versions of the same package installed, similarly limited is the ability to verify package file checksums, because as I was pointing out recently to someone I was discussing packaging issues with: the list of package file checksums is not cryptographically signed:

#  ls -ld whois.*
-rw-r--r-- 1 root root 172 2010-11-26 11:11 whois.list
-rw-r--r-- 1 root root 240 2010-03-20 05:09 whois.md5sums
#  cat whois.list 
/.
/usr
/usr/bin
/usr/bin/whois
/usr/share
/usr/share/doc
/usr/share/doc/whois
/usr/share/doc/whois/README
/usr/share/doc/whois/copyright
/usr/share/doc/whois/changelog.gz
#  cat whois.md5sums 
370b01593529c4852210b315e9f35313  usr/bin/whois
1835d726461deb9046263f29c8f31fe8  usr/share/doc/whois/README
971fdba71f4a262ae727031e9d631fa8  usr/share/doc/whois/copyright
78f366b8fb0bb926a2925714aa21fbe7  usr/share/doc/whois/changelog.gz

Which of course means that the checksum list cannot be used to verify the integrity of installed files except at installation time or for accidental damage. It is surely possible to use a separate integrity checker for deliberate modifications, but that only reveals changes in the file contents, not whether they are as they were when built by the distribution, which is a far more interesting notion.

Note: the signature situation is also quite dubious, as many packages are not signed at all, never mind the checksum files, and only repositories are signed, which is far from adequate.

Also while the distribution checksums can be verified using an easily retrieved public key, those produced by an integrity checker need either to be copied to a secure location, or be signed with a private key on the system on which the integrity checker runs, which is rather risky.

Anyhow package repositories and their package lists are part of the dependency management layer, based on APT in Debian, and unrelated to DPKG and the package management layer (to the point that the APT dependency management tools can also handle RPM packages), and the APT tools seem to me rather better (especially Aptitude) than the package management ones.

What is amazing is that despite these grave packaging tool issues Debian continues to be a successful distribution, in the same crude popularity sense that MS-Windows is successful because it is popular, despite (among many others) even worse package and dependency management issues (as MS-Windows barely has either).

120421b Sat: Two approaches to OLED, and new approach with LEDs

TIL about two different approaches to OLED display structuring: reportedly Samsung AMOLED displays use separate OLEDs for red, green and blue, each emitting the specific color, while LG uses white OLEDs with a color filter on top. I must admit that I had missed the existence of white OLEDs.

TIL also about the new display technology by Sony called Crystal LED and is about a display made of distinct red, green and blue LEDs.

The latter seems to me an evolution of LED backlights for LCD monitors. While most LED backlights involve only a few bright white LEDs providing illumination across the display, more sophisticated ones have a LED per pixel (which means little uneveness of illumination and allowing a wider dynamic range of luminosity of each pixel). It probably occurred to Sony engineers that tripling the number of backlight LEDs would allow them to eliminate the LCD filter altogether, transforming the display from transmissive to emissive.

The subtext to these developments is that Taiwanese and Chinese manufacturers of ordinary LCDs have invested in a lot of production capacity for LCD displays, and Japanese and Korean companies are trying to push forward with technologies that their competitors have not yet invested in.

120421 Sat: First device with larger OLED display announced

Samsung have announced their Galaxy Tab 7.7 tablet and one of the main features is that it has a 7.7" AMOLED display with a 1280×800 pixel size, which gives it a pixel density of 200DPI.

This is the first OLED display with a diagonal larger than 3in for a mass market device, and the quick impression of the reporter from Computer Shopper UK is that it is a display of exceptional quality, both because of the fairly high DPI, and its luminosity and wide viewing angle.

As I am not particularly keen on tablets as I write a fair bit and even a laptop keyboard is much better than an on-screen keyboard, this display to me has the import that 1280×800 are pixel dimensions typical of laptops (at least before the latest rounds of increasingly skewed aspect ratios) which means that AMOLED displays may be coming to laptops, and eventually to standalone monitors.

120414 Sat: RMW description and cases where it happens

I had seen in a recent discussion thread a misunderstanding of advising filesystems of the alignment properties of the underlying storage device, in particular stripe size which really means alignment more than size.

Advising a filesystem of stripe size is really about adivising it that the underlying storage medium has two different addressing granularities and some special properties:

Transfers can be parallelized at some multiple of the filesystem sector size.
Reading is possible at addresses multiple of the filesystem sector size.
Writing is only possible at addresses multiple of a different and (as a rule) bigger write-block, which can be a logical or physical write sector size.
The storage device pretends that writing is possible at the same addresses as reading, and transparently handles the mismatch between reading and writing granularities at great cost.

These things happen usually with devices where the physical sector size is larger than the logical sector size expected, as a matter of tradition, by most software using the device, including filesystems. In rare case, such as flash memory, the physics of the recording medium have intrinsically different read and write granularities.

In extreme yet common cases the effective or implied physical sector size for writing can be some order of magnitude larger than the one for reading, for example on some flash memory systems the sector size for reading is 4KiB but the sector size for overwriting is 1024KiB; the cost of simulating writing at smaller address granules can be an order of magnitude or two in sequential transfer rate.

A hint to the filesystem about the stripe size is a hint that allocation of space on and writing to the device should preferentially done in multiples of that number and not of the sector or block sizes; put another way that the read and write sector sizes are different or should be different, because while it is possible to write sectors with the same size, a different (usually larger) sector size for writes is far more efficient.

If the filesystem disregards this, it is possible that two bad or very bad things may happen on writing:

The physical device itself reads the physical write sector to be updated, updates the read copy, and then writes back the whole physical write sector even if only partially modified. In extreme cases two physical write sectors have to be involved if the write straddles a bounday between two.
This may happen several times as successive logical writes require modification of the same physical write sector. Caching of physical write sectors is not always entirely effective, and while physical sector writes can be delayed, this can have dire effects on transaction latency and durability.

Most devices where as a rule writes should be done with a different sector size from reads are:

System memory: because of interleaving it often has a minimum transaction size for both reads and writes; and as a rule accesses to system memory are mediated by the cache and occur only in line sized transactions anyhow.
Rotating disks: recent devices with a physical sector size of 4096B that however declare a logical sector size of 512B.
DVD-R[W]: the physical sector size is 32KiB, but the logical sector size declared by the device is 2KiB for compatibility with CD-R[W].
Flash memory: while the profile is a bit more complex than this, as a rule flash memory can be read in 4KiB or 8KiB flash pages but only written in flash blocks of 256KiB to 1MiB.
Parity RAID with a stripe size larger than the sector size of the member devices: in order to recalculate parity on writes, the whole stripe needs to be read, updates, parities recomputed and the whole strip plus parities have to be written back. In the case of single parity it is fairly common to use a type of parity where this can be abbreviated to reading just the old data, the parity, updating them, and writing the new data and parity to storage.

In all the cases above the insidious problem is that there is a firmware or software layer that attempts to mask the difference in sector sizes between read and write by automatically performing RMW, with often disastrous performance.

There are also a few cases where no RMW is necessary, yet performance is affected if the write sector size used is the one declared by the device, for example:

Rotating disks: writing (and reading) in track-sized sectors is as a rule far faster than multiple sector writes to the same amount of data. Most rotating storage devices have firmware that reads ahead whole tracks, or buffers writes, to enable writing whole tracks.
Memory: RAM chips have an internal structure that also makes it rather more efficient in throughput terms to write in larger units, even if the smaller units are well aligned.

120317 Sat: Mac OSX ignores 'fsync' on local drives

The subject of when exactly applications and various kernel layers commit data to persistent storage is a complicated and difficult one, and for POSIX style systems it revolves around the speed and safety of fsync(2), an (underused) system call of great importance and critical to reliable applications.

Therefore I was astonished when in a decent presentation about fsync and related issues, Eat My Data: How Everybody Gets File IO Wrong, on slide 118, I found the news that on Mac OSX requests to commit to persistent storage are ignored for local drives:

/* Apple has disabled fsync() for internal disk drives in OS X. That
   caused corruption for a user when he tested a power outage. Let us in
   OS X use a nonstandard flush method recommended by an Apple
   engineer. */

Note: On Mac OSX fsync has the narrow role of flushing only the system memory block cache (which is the most conservative interpretation of the POSIX definition of fsync), and does not flush the drive cache too, for which the platform specific F_FULLFSYNC option of fcntl(2) is provided.

That is quite unsafe. While fsync is usually a very expensive operation, and Mac OSX platforms usually (MacBooks) have battery backup (being mostly laptops), just nullifying the effect of fsync is a very brave decision.

These news might explain why on a Mac OSX laptop of a friend /usr/sbin/cups at some point vanished, and there were several other missing bits: it is entirely possible that during a system update the laptop crashed either because of an error, or because the battery run out of charge, or something else.

120310 Sun: Synchronicity and chunk size

One of the apparent mysteries of RAID is the impact of chunk size, which is for a stripe the number of consecutive logical sectors in the stripe on a single member of the RAID set.

In theory there is no need of a chunk size other than 1 logical sector, at least for RAID setups where the logical sector is not smaller than the physical one, however in practice the default chunk size of most RAID implementation is much larger than a single logical sector, often 64KiB even if the logical sector is 512B.

Obviously this is meant to improve at least some type of performance, and it does, even if large stripe sizes are a bad idea especially for RAID6 but to understand that one needs to look at synchronicity of RAID set members when they have non-uniform access times.

The problem is that when reading a whole stripe if the devices are not synchronized the current position of those devices may be different, and therefore reading all the chunks that make up a stripe involves different positioning latencies for every chunk, making the time needed to gather all chunks much longer than that needed for reading each chunk.

Making each chunk larger than a single logical sector spreads the cost of the per-chunk positioning latency over size of the chunk, increasing throughput. Therefore having chunks of 64KiB instead of 512B means that the cost of the spread of latencies has to be incurred 128 times less often.

This however only really applies to streaming sequential transfers, and mostly only to reading, because when writing sectors are typically subject to write-behind and therefore can be scheduled by the block scheduler in an optimal order after the factor.

There is something of a myth that suggests the opposite, that small chunk sizes are best for sequential transfers as they maximize bandwidth by spreading the transfer across as many drives as possible, while large chunk sizes are best for random transfers as they maximize IOPS by increasing the chance that each single transfer only uses one device leaving the others for other transfers at other positions.

It is a bit of a myth because if the sequential transfers are streaming (that is long) all drives are likely be used in parallel as the reads get queued, unless the read rate is lower than what the RAID set can deliver, in which case latency will be impacted. Put another way, a small chunk size will improve sequential transfers done in a size much larger than the chunk size, but most sequential transfers are streaming.

For random transfers the myth is more credible, because 4 transfers in parallel each reading 8 blocks serially takes less time than 2 transfers in parallel each reading 4 blocks in parallel over 2 drives. The reason is that parallelizing seeks is far more important than parallelizing block transfers, if the latter are not that large, because seeks take a lot longer.

As to that, a typical contemporary rotating storage device may have an average access time of 10ms, and be able to transfer 120MB/s, that is 1.2MB in 10ms. So 4 seeks parallelized 4-way cost 10ms, plus the data transfer time, while parallelizing 2-way 4 seeks costs 20ms plus half of the data transfer time, and the latter is better only if data transfers take 20ms or more, that is they are 2.4MB or more. The problem of course is that random transfers may happen to on positions within a chunk, which serializes seeks, but then if they are within a chunk they should be short seeks and fast.

However in general I prefer smallish chunk sizes, because the major advantage of RAID is parallelism especially for sequential transfers, and the major disadvantage is large stripe sizes in parity RAID setups.

Finally, some filesystems have layout optimizations that are based on the chunk size: metadata is laid out on chunk boundaries to maximize the parallelism in metadata access, and this is quite independent of stripe alignment for avoiding RMW in parity RAID setups. One example is ext4 with the -E stride=blocks parameter to mke2fs:

This is the number of blocks read or written to disk before moving to the next disk, which is sometimes referred to as the chunk size. This mostly affects placement of filesystem metadata like bitmaps at mke2fs time to avoid placing them on a single disk, which can hurt performance. It may also be used by the block allocator.

It is recognized by XFS with the sunit=sectors parameter to mkfs.xfs:

This suboption ensures that data allocations will be stripe unit aligned when the current end of file is being extended and the file size is larger than 512KiB. Also inode allocations and the internal log will be stripe unit aligned.

120305 Mon: The power of narrow routes, including replacing VLANs

Reconsidering my previously suggested resilient network setup which was based on taking advantage of routing flexibility thanks to OSPF, and the 6to4 setup under Linux they both use in very different ways unicasting (or anycasting).

Now that is in general a really powerful technique that is quite underestimate, mostly for historical reasons, and most people tend to think in terms of subnetting and route aggregation via prefixes.

The historical reasons are somewhat contradictory: the original ARPAnet was designed as a mesh network, using point-to-point links, where routes were to invidual hosts, and there weren't many nodes (when I started using it as a kid there were 30 nodes).

Then the original IP specification had fixed-size address ranges:

An address begins with a network number, followed by local address (called the "rest" field). There are three formats or classes of internet addresses: in class a, the high order bit is zero, the next 7 bits are the network, and the last 24 bits are the local address; in class b, the high order two bits are one-zero, the next 14 bits are the network and the last 16 bits are the local address; in class c, the high order three bits are one-one-zero, the next 21 bits are the network and the last 8 bits are the local address.

In particular there can be very many ranges with 256-addresses, that is 2²¹ or around two million. This rapidly caused concern as various organizations started connecting newly-available Ethernet LANs to the Internet, and requesting distinct 256-address ranges for each:

An organization that has been forced to use more than one LAN has three choices for assigning Internet addresses:

Acquire a distinct Internet network number for each cable.

The first, although not requiring any new or modified protocols, does result in an explosion in the size of Internet routing tables. Information about the internal details of local connectivity is propagated everywhere, although it is of little or no use outside the local organization. Especially as some current gateway implementations do not have much space for routing tables, it would be nice to avoid this problem.

The concern arose because the original ARPAnet routers and their successors were by today's standards very slow computers, with very limited memory capacity (16KiB) and this resulted in the definition of subnetting (and supernetting) on arbitrary boundaries.

However a lot of people did allocate many portable class C ranges, even if in some cases as subnets of class B and class A ranges, and this led to large increases in routing table size, and then to a stricter policy of assigning non-portable ranges to organizations to minimize the size of the global routing tables, and this was made very strict with IPv6, where routing is (nearly completely) strictly hierarchical instead of mesh like.

But because of the intermediate period with many disjoint class C ranges and the advent of routers based on VLSI processors and memory even relatively low level routers today support quite large routing tables, for example the Avaya ERS5000 series support up to 4,000 routes (which is not much less than the up to 16,000 Ethernet addresses) and looking at a lower end brand like D-Link their most basic router is the DGS-3612G and is can support up to 12,000 routes (and 16,000 Ethernet address), and core routers, those that can be used as border gateways, can support hundreds of thousands to millions of routes. Linux™ based routers are reported able to handle at least as many.

Given this, it seems possible for many sites to just stop using subnet routing, and assign to each node a unique host (/32, 255.255.255.255) route.

This may seem somewhat excessive, and indeed it is, but it is feasible, and may be quite desirable, if not for all nodes, often for server nodes. The reason is that the unique, routed IP address assigned to the server is effectively location independent, and this gives at least two useful advantages:

The server can be relocated from one physical link to another without changing is IP address.
The server can be connected to multiple physical links, and its unique IP (usually on a dummy interface) will be reachable from all of them, and if ECMP routing is in operation, in a load-balanced way.

For several reasons (including debugging) often the better option would be to give servers both a unique routable IP address, and a subnet address for each link interface it is on.

There is are some limitations from using unique routable addresses:

The loss of ability to use IP level subnet broadcast addresses, and I think that in many cases this is unimportant, or even desirable.
Traffic between two nodes on the same link would have to be via a router, as the two nodes would be on different IP subnets, and therefore various types of neighbour discovery would not work, but this can be worked around in a number of fairly reasonable ways (for example link specific direct routes).

It is useful to note that single (or otherwise narrow) IP routes achieve at a higher protocol level the same effect as VLANs when the latter are used to achieve subnet location independence.

Note: subnet location independence with one IP subnet per broadcast domain, and VLAN tagging of Ethernet frames is used to create multiple broadcast domains over a bridged network with more than one switch.

The location independence is achieved by having in effect individual link level routes to each Ethernet address in the bridged network, and that's why the switches mentioned above have Ethernet forwarding tables capable of holding as many as 16,000 Ethernet addresses, as every Ethernet address in the whole infrastructure must be forwardable from every switch.

VLAN tags and the notorious STP then reduce the costs by partition the network in effect into link-level subnets of which the VLAN tag is the network number.

The same effect, minus the risk of broadcast storms and other very undesirable properties of a bridged network, can be achieved by using routed IP address where IP addresses belonging to the same subnet have individual routes. Or perhaps where most IP addresses have the same route as most of those nodes are on the same link, and scattered nodes are routed to by more specific routes.

There are good arguments why IP addressing based on functional (for example workgroup) relatedness is rather less preferable than geographical (sharing of a link) relatedness, but if it is needed, it is better to bring it up to the highest protocol layer possible. Indeed my preference is for IP addresses to be strictly based on geographical relatedness (one link, one subnet), and to offer functional relatedness via the DNS with suitable naming structures.

Note: individually routable IP addresses are in effect node or even service identifiers, in effect names brought down from the application naming layer to the transport layer.

However in some important cases functional relatedness it best handled at the IP level (because of applications that don't ue DNS or that resolve names only when they start), and then very specific routes, thanks to routers that can handle hundreds or thousands of them, can be quite useful.