Software and hardware annotations 2007 May

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

070528 A poor UNIX legacy: dotfiles in the home directory

I feel that in recent and not so recent implementations of the UNIX design, like Linux, many of the aspects that are simple and work well and are worthwhile are those have not been changed from UNIX, and those that are complicated and work poorly have been introduced by less gifted designers often transparently inspired by the Microsoft cultural hegemony There is however at least one aspect of the UNIX tradition that I feel is quite annoying, and it is dotfiles, and in particular user configuration files as dotfiles in home directories.
The convention not to list files whose name begins with a dot is entirely arbitrary in particular because it is implemented in user code. When just about the only program that would list directories was ls it might have been tenable, but anyhow it was a wrong idea. The rationale was to avoid listing the . and .. entries, and was generalized to any files beginning with a dot. But the more appropriate design would have been to decide that when listing a directory the two special entries would never be listed, even if present.
The other regrettable convention was to put configuration files in the user home directories, which was facilitated by the convention that by default dotfiles are not listed. Now I have several hundred dotfile configuration files and directories, and this has a couple of problems, the major one being that they are easy to forget, precisely because they are not listed by default. The minor one is that in many file systems directories with hundreds of entries are slow, and a user's home directory must be traversed in very many cases. The more appropriate design would have been to reproduce the classic UNIX style hierarchy inside a user home directory, and have a convention to put all user configuration inside an etc subdirectory of the user's home directory, and not dotfiles. Too late now.

070527f Thinking again about simplicity: laptop versus desktop

Thinking again about simplicity the issue is often which simplicity to privilege. An example comes from some discussion about what to buy, a laptop or a desktop:

laptop: Well, a laptop is a self contained unit that is very simple to use. It has all you need, no cables hanging around, to repair it you just send it to the manufacturer, it can be carried around in a bag. It is in effect an appliance.
desktop: A desktop is simplicity itself as to upgrading and repairing: you open it, plug in new parts, no need to send it for repair to the manufacturer as most breakages can fixed quickly by replacing a component.

Then it all depends on circumstances: a business may well decide they don't need expandability, and standardize on laptops and keep a stock of spare laptops to obviate the repair delay; I have a desktop because for me it is much simpler to upgrade and repair it myself. The same for the eternal debate between bridging and routing: bridging makes some forms of configuration simpler, routing makes fault diagnosis and isolation simpler (but I can't imagine any way in which RAID5 makes things simpler :-)).

070527e The problems with Wiki documents

Whe writing documents in particular technical documentation I prefer to write it in HTML (using a validating, DTD driven editor like XEmacs with PSGML or jEdit with Xerces-J. I have also been contributing occasionally to some Wikis, and I much prefer using HTML. The one advantage of a Wiki is shared editing, but this is paid for at a high cost:

Wiki text can only be properly read by installing a web server and a Wiki engine, HTML can be read with any browser.
There is no easy way to check Wiki text syntax, and it lacks some of the useful details of HTML.
The most common forms of Wiki text have a really annoying syntax, and the many different Wiki engines often accept rather different markup conventions (for example 1, 2, 3, 4, 5). HTML has a more verbose but equally simple syntax, and one which is well standardized, so there are very many tools that can be used to process it.
A directory with a number of HTML files is easy to grep or process, a set of Wiki texts not so easy (unless the Wiki is based on text files instead of a DBMS backend).

I reckon that HTML is simple enough to be used directly, and none of the issues above, and because of the. Other people think that Wiki syntax is less intimidating, and that's why it is popular; but I think that Wikis are popular not because of a simpler syntax (it is not to me) but because they avoid the need to setup and maintain a web server, as the Wiki owners does that.

070527d Selling advanced switching features and split trunking

Recently I had a meeting with a salesperson from a major reseller of network equipment, and the pitch was about advanced features in the routing switches they sell, and the attendant consulting services. This was a bit inappropriate because the network in which these are used is rather functionally simple (just an internet of 10-30 laboratories) but has huge performance requirements.
Anyhow the major pitch was for a very advanced feature called split trunking which promises an amazing advantage: to have two outgoing routers with transparent load balancing and failover; somewhat like a cluster system (such as Heartbeat, RHCS) but for the gateway service instead of a higher level service like HTTP. If that does not sound amazing, note that in general a leaf node in IP use only one router address per route, e.g. 10.0.0.1 for the default outgoing route for 10.0.0.0/8. There are special case exceptions, for example Linux has the advanced routing extensions, but in general IP is based on one active gateway per route. Transparent failover of that one route can be done, but only by turning each node into a router that listens for route updates, or pointing each node to a local router that does.
But split trunking provided by this product allows one to have a fully bridged internal LAN (with VLANs), and then to have a couple of routers to the outside that provide transparent load balancing and failover even to nodes who don't have load balancing routing extensions and without listening for route update. This sounds absolutely fantastic: anybody with more money than sense can look good by purchasing such a solution, which automagically works without requiring any complications like routing or advanced routing setup. Salesman's routing!
But to others it may sound impossible, and indeed it is if it is attempted within the limits of conventional IP technology, because it is based on the two routers having the same IP address, and the same Ethernet address too for good measure. The whole trick is based on the idea that any node sending their packets to the two gateway will be attached to a switch with a learning table (Ethernet address to port), and the same Ethernet address can be in two entries of the table, and the switch can forward Ethernet frames across both entries with the same address. Then when one of the routers with the same Ethernet MAC address fails the entries pointing to it are unlearned (usually with an explicit notification process), and traffic then flows only to the surviving router, a hinted here (I have added some emphasis to critical phrases):

Split Multi-Link Trunking (SMLT), a Nortel Networks extension to IEEE 802.3ad, improves on the level of Layer 2 resiliency by providing nodal protection in addition to link failure protection and flexible bandwidth scaling. It achieves this by allowing edge switches using IEEE 802.3ad to dual home to two SMLT aggregation switches. It is important to note that SMLT is transparent to attached devices supporting IEEE 802.3ad.
Because SMLT inherently avoids loops due to its superior enhanced-link aggregation-control-protocol, when designing networks using SMLT, it is not necessary to use the IEEE 802.1D/w Spanning Tree protocols. Instead, a method is used that allows two aggregation switches to appear as a single device to edge switches that are dual homed to the aggregation switches. The aggregation switches are inter- connected using an InterSwitch Trunk (IST), over which they exchange addressing and state information, permitting rapid fault detection and forwarding path modification. Although SMLT is primarily designed for Layer 2, it also provides benefits for Layer 3 networks.

This requires violating the spirit if not the letter of the designs of both Ethernet and IP, but if everything works, it gives transparent load balancing and failover without any changes to client devices, and to the manufacturer it also delivers the advantage of locking customers into their proprietary software and protocol (because all the switches involved, not just the two routers, must support the extensions) to a high degree, even if the protocol has been proposed for standardization.
What is the problem with violating the spirit or the letter of existing designs? Reliability, usually, which depends on simplicity and extensive practice. One of my arguments about large bridged networks is that routing exists for some good reasons, and routing software (in simple situations) has been extensively tested. As to this, I have seen a smart, experienced network manager attempt a few times to implement in the field a split trunking configuration, and fail for reasons that have proven hard to dagnose, and even attempting it required having equipment all from the same supplier, and specific software versions (as despite the proposed standardization different software versions from that same manufacturer are not quite compatible apparently). Even if apparently it works in several cases and in the lab...

070527c Wildcard domain names update and spam avoidance scheme

While reading about some amusing issues with the evolution of the DNS I discovered that there is a somewhat recent RFC 4592 on DNS wildcards which one place clears up some points about DNS wildcards. I am interested in those because my spam avoidance scheme uses wildcards. The main points about spam are that:

It is essentially free for the sender but costly for the receiver.
It can only be sent if the receiver's address is valid.

This suggests that filtering (even when effective) is sort of silly as when it is received the cost of receiving the spam message has already been paid, and apparently 85% of the messages passing through most mail servers are spam. So one should prevent sending of a spam message rather than throwing it away after receiving it.
But there are only two ways to prevent the sending of a spam message, the first being not giving spammers the address, and the second retracting it after giving it. In practice only the second way works, because e-mail addresses leak easily into public space. The solutions is therefore to use disposable, easily retracted addresses. But retracting an address means inconveniencing welcome as well as unwelcome senders of e-mail. The solution to that is to have a different e-mail address per sender or at least per category of senders, so that if a e-mail address is invalidated as it is used as a spam target only one or a few welcome senders are inconvenienced.
The question is then how to get numerous disposable e-mail addresses, and some people just use several registered with free e-mail providers. But these addresses remain valid for a long time, and just become spam-sinks, and do not prevent the sending of the spam, they just shift the cost of receiving it to someone else. There are some web based services that do forwarding via periodically expired addresses, but these still have two defects: the expiration is usually not under the control of the receiver, and e-mail gets read by the service as it gets forwarded by it.
One solution is to use a different domain per correspondent, for example tony.mail.example.org (for people) or flickr.mbox.example.org (for sites and companies), and then delete it if it gets spammed and reissue the welcome senders with a new one. This is easiest if one can run a DNS and SMTP server. But there is a better solution that relies on the notion that most of these e-mail addresses would not get spam, and explicitly adding one per correspondent is too tedious.
So my solution is to run a DNS server and use wildcard e-mail addresses too. Then I make up correspondent specific domain names under the wildcard for private correspondence, but I explicitly define a serially numbered one for places like USENET News and web sites where a reply address can and should be expired periodically, using domains like news08.example.com). It is also quite important to make sure that the main domain name cannot be used a mail target, by ensuring it does not have an MX RR (and no A RR ideally), and that any servers with a published domain names do not have an SMTP server on them. The wildcard scheme also requires an SMTP server that can accept e-mail for wildcard domain names (for example exim). Now, if all domain names under the examples above are valid as they are covered by the wildcard, how can one retract the few that eventually get spam? Well, the rule is that a more specific definition overrides the wildcard one, so just adding any RR definition for it. Therefore an example DNS zone file (in the usual BIND syntax) may look like in part:

$ORIGIN example.org
@               NS      NS0
@               NS      NS1
NS0             A       10.0.0.2
NS1             A       10.0.0.3

WWW             A       10.0.0.2
SMTP            A       10.0.0.3
POP3            A       10.0.0.3

; Wildcards
*.mail          MX 1    SMTP
*.mbox          MX 1    SMTP

; Number changed periodically
me08            MX 1    SMTP

; Invalidated
phat.mbox       TXT     "spammed"
tony.mail       TXT     "spammed"
anne.mail       TXT     "spammed"

070527b When bridging or RAID5 are appropriate

While I have manifested my dislike of bridged networks and amusement about RAID5 there are some special cases where they are applicable, and usually in the small.
As to bridged networks I had some interesting discussions with a rather smart person, and while we respect each other's arguments, we found our difference boiled down to how large a LAN should be: my limit being 30-50 nodes and a single switch, and his being 500-1000 nodes and 10-20 switches. I still think that an internet should be based on IP routing and not on Ethernet bridging, and that the latter does not scale, but sometimes it is expedient to use bridging as a port multiplier for a single LAN, up to a low threshold.
As to RAID5 (in general parity RAID), there are only three cases where it makes some sense:

With exactly 3 discs, as RAID10 cannot apply, there is some redundancy, and the speed penalty of a stripe size of 2 is not large. However a RAID10 over 4 discs is still much much better for only a little more money.
In the quite rare cases where applications write in aligned stripe sized units, and written data does have large speed or persistency requirements, and the stripe size is small. Having applications doing aligned stripe sized writes is not terribly easy, because one must know the alignment to start with, which can be obscured by partitioning, virtual volumes etc., and then arrange the batching of writes. Using something like the sunit and swidth parameters of the XFS file system is pretty essential (for both data and the log/journal).
Another smart guy from BNL persuaded me that in cases similar to his RAID5 is an acceptable compromise, for what is essentially an online cache of a much larger offline archive:
- The data gets written once and read many times.
- The data is already fully backed up.
- The main purpose of the small degree of redundancy of RAID5 is to minimize service disruptions when a single disk fails in one of the many subsets of the data, mostly as a convenience, as continued operation at regular speed is not essential.

It is indeed possible to get somewhat reasonable performance out of an undamaged RAID5 storage system, if those conditions are met or one accepts significantly lower write performance than for RAID10.

070527 RAID5 and bridging/VLANs are not free lunches

While discussing the popularity of RAID5 (and other types of parity RAID) and bridging/VLANs with some friends one of them who worked for a popular RAID product supplier said that he though that RAID5 is salesman's RAID; by this he means that RAID5 is the ideal sales pitch as it promises both redundancy and low cost, and of course excellent performance due to inevitable proprietary enhancements. Similarly for VLANs, that are presented as the ideal way to expand an installation limitlessly and transparently in the most flexible way.
The pitches for RAID5 and VLANs are based on the premise that difficult choices about difficult tradeoffs are not necessary, and that anybody with more money than sense can purchase a ready-made solution that does not require the effort to study issues like performance vs. redundancy vs. cost or complicated stuff like routing.

070520b Non uniform LCD monitor colour temperature

Having recently purchased a second LCD monitor for my PC, an LG L1752S to complement my existing LG L1710B, I have realized by comparing them and trying to set them up so they are visually similar that in both the colour temperature depends rather noticeably (once it is noticed) on the vertical angle of vision.
Quite frustratingly the two have opposite gradients: the L1752S looks bluer at the top and yellower at the bottom, and the L1710B looks yellower at the top and bluer at the bottom. Also the older one (L1710B) tends to warmer colors than the other; this may be because of an older backlight, or perhaps because the newer one may have a newer LED backlight (anyhow the newer monitor's backlight is also noticeably brighter). Also, the colour temperature gradient on the L1752 is rather more pronounced than on the L1710B.
Another thing that I have noticed is that in the on-screen menu for setting a custom colour temperature settings significantly different from the middle ones (50) cause obvious colour distortions, that is the visual effect is non linear and strongly different for the three colours.

070520 Another RAID perversity

From a reader of my notes an amusing (not for him I guess) tale of RAID misconfiguration:

[ ... ] an NFS server that has two 12 disk arrays attached to it. The drives are 250GB SATA devices, so there is a total of about 3T of space in each array, 6TB raw space. On boot, the devices and capacities are detected by the kernel, and display these two lines:
SCSI device sdb: 3866554368 512-byte hdwr sectors (1979676 MB)
SCSI device sdc: 3866554368 512-byte hdwr sectors (1979676 MB)
I start wondering where one third of the array went, and start digging. The former admin mentioned that they were configured into a as "Two RAID 5+0 array[s ...] with a usable storage capacity of 3.6TB Filesystems are striped across both arrays using LVM2." Confused, I start running some calculations to figure out how he managed to convert 6TB of space into 3.6T, using RAID5.
It appears that there are actually four RAID5 arrays (two per physical array), each with one hot spare. Within the RAID hardware, the two arrays are are striped (RAID5+0). The numbers mostly work for this configuration:
(12 drives - 2 for spares - 2 for RAID5) * 250GB = 2TB per array
LVM2 is then used to stripe the two RAID5+0 arrays (does that make it RAID5+0+0?)

At least LVM2 is used in striping mode instead of linear mode; conversely I have noticed that using LVM2 to linearly span a logical volume across two physical discs is fairly popular.

070513b The missing links

One inevitable disappointment with the WWW is that on many sites the text in many pages has no link (or very few), for example most of the articles linked-to in my previous notes. For commercial sites this is easy to understand: they believe that revenue, mostly obtained through advertising, depends on stickiness and thus any links reduce revenues. This is not as relevant for non commercial web sites which often index their texts with links, notably at Wikipedia. Another problem is felt at many sites: lazyness which leads to treat hypertext pages as if they were mere text pages (which results in many commercial web sites not having even links between pages in the same site, except for boilerplate navigation). But the intentional lack of links when it happens is based on the idea that the stickiest web site has a lot of links pointing to it, but none from it to any other site (unless that site pays or equivalent, like link exchange).
Now look at it from a distance: if everybody does that, and all sites are dead ends, the WWW ceases to exist, because it becomes a collection of self contained collections of text rather than hypertext. Then the only connective linkage between sites become search engines and directories, as to link one simply copies some portion of the text and then pastes it into a search engine form; indeed this is happening and some users not only have search engine forms as their home page, but even type URLs in those forms instead of in the browser's address field. So more power to text search engines and directories, as the WWW devolves from hypertext back to text. But the most important of those is Google and Google itself has important financial impacts on sites, and again it is that incoming links are rewarded and outgoing ones are penalized, as that directly relates to advertising and sales revenue. However there is a catch here for Google: its text indexing scheme is based on links as well the text that contains them. So Google as well as advertisers are in effect giving significant financial incentives to sites to undermine not just the link based nature of the the WWW, but also that of one of the most effective WWW indexing schemes, the one that Google uses. Thus not just the avoidance of links, especially those outgoing to other sites, within supposed hypertext, but also the establishment of large fake sites with the only purpose of providing large numbers of incoming links to those sites which themselves have as few ougoing links as possible. It will be interesting to see how this story evolves.

070513 PS3 very suitable for Folding@Home

Interesting and unsurprising report that for some embarassingly parallel applications like Folding@Home the PS3 delivers very high performance running a fairly nice background application. That is not surprising as each of the 6-7 secondary processing elements in a Cell Broadband Engine architecture 1, 2, 3, 4) the vector processing elements can delivering a lot of power being 3GHz each if the problem data set fits in 256KiB, which is not too small.

070506c On how to keep a job as a game programmer

Entertaining advice from an article on how to keep a job as a programmer, in the game industry:

"I know it's hard to keep focused on your current projects and at same time watch the industry trends," Mencher says, "but you must do this and ensure you're gaining the next generation coding skills needed for the next killer job.
For programmers, it's never safe to stay focused on only one technology or one platform. It is safe to keep learning and increasing your skill base even if this means doing it off work hours."

Well, I don't quite agree here: the single most important skill for keeping a job as a programmer, not just in the games industry, is to live in a low cost country, from Eastern Europe to India or the Philippines. Those programmers without such a skill will remain employed in the same way that naval engineers remained employed in Glasgow. Anyhow the idea that game employers discard those without the latest skills to hire those who have them is just a symptom that there is a large oversupply of programmers with recent skills.

070506b Virtual machines as failures of existing kernels

There have been a number of recent developments in virtual machine software for x86 style systems; in itself nothing new, I was managing a VM/370 installation on an IBM 3083 over a decade ago, and the IBM 3083 was nowhere as powerful as a contemporary desktop PC. But there is a common thread here: virtual machines are being used now, as then, as a workaround, because the underlying kernels or operating systems are too limited or too buggy. Many years ago it was to work around the limitations of OS/360 and its successors, in one of the SVS, MVT, MVS incarnations: SVS had a single address space, so you would run multiple SVS instances in different virtual machines, MVT did not have virtual memory, so you would run even a single instance in a virtual machine to get around that, and MVS had poor support for production and development environments, so one would run the two environments in separate virtual machines. The same is happening now, largely because of the same reasons; the one inevitable reason to run virtual machines is the obvious one, to run completely different operating systems, but let's look at the other more common reasons for doing so:

To consolidate multiple servers: This is by far the most common rationale. Well, server consolidation is an improper goal: why run multiple virtual servers on a single real machine? What makes more sense is service consolidation and a single server can well be used to consolidate multiple services. The problem with this is that existing operating systems make it awkward to merge configurations, to support multiple ABI versions, etc., so that it is sometimes easier to consolidate by just creating a snapshot image of existing servers and then run them in parallel.
To provide isolation between different services: This is a safety and security issue, and it is the saddest reason for using virtual machines, because service isolation and containment are in effect the whole rationale for operating system kernels. That it may be of benefit (and I agree that in practice it might be) to add a second layer of isolation and containment beneath the kernel is sad. Admittedly this is already done in high security sites by running separate hardware machines (on separate networks in separate physical locations), but here we are talking about ordinary sites whose service isolation and containment needs should be satisfiable with a single kernel.
To run a different execution environments: This is another sad one. The idea here is that one might want to run applications relying on different ABI levels. But hey, isn't that something that can be solved easily by installing different ABI levels concurrently and then using things like paths to select the right one for each service application? Well, if only it were that easy. There are famous distributions with package and dependency managers that make even just installing different versions of the same library awakward. It can be easier to just create a wholly distinct system image and instance just to work around such limitations.
To run a different operational environments: In this case one wants to have different environment for example for production, testing and development. There is some point here in running multiple system instances in virtual machines here. The reason is that switchover between testing and production, for example, depends on verification and a sharp changeover, and stability, and just switching images is indeed the easiest way to do that. Just because tools to verify the stability of an environment are difficult to write.

Except for running wholly different operating systems sharing the same hardware I prefer, if at all necessary, weaker virtualizations, like for example my favourite Linux VServer which arguably should be part of the standard Linux kernel, as it provides resource abstraction functionality that should belong in every operating system. And never mind ancient but lost in the mists of time concepts like transparent interposition.

070506 One of the first towns to get IPv6

From BusinessWeek an article about a USA town switching to IPv6. One can only hope that this is the start of a trend, as the limitations of IPv4 are really starting to matter. Even if I suspect that NAT will still be used with IPV6 but then only by choice, not because there aren't enough addresses.

070505b A recent RAID host adapter does RAID3

While looking for eSATA host adapter I was somewhat startled to find that XFX a company usually known for their graphics cards has launched the Revo a line of SATA RAID host adapters which support RAID3 with models with 3 or 5 ports (reviewed for example here). It is not clear to me if it is really RAID3 or (more likely) RAID4, that is whether striping is by byte(s) or by block. The XFX product is based on a chip by Netcell. Fascinating though, as RAID3 has become less popular in time, even if it makes occasional unexplained reappearances.

070505 Copying a 500GB disk over eSATA

Numbers from copying a whole 500GB disk from an internal SATA drive to an external eSATA drive:

# ionice -c3 dd bs=16k if=/dev/sda of=/dev/sdb
30524161+1 records in 30524161+1 records out
500107862016 bytes (500 GB) copied, 10596.3 seconds, 47.2 MB/s

Cheap backup at 47MB/s (average over outer and inner tracks) can't be much wrong, especially compared to 12-15MB/s for Firewire 400 and 20-25MB/s for USB2. The only comparable alternative is Firewire 800 but my impression is that external cases and host adapters for it are far more expensive than for eSATA.