Software and hardware annotations q3 2006

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

September 2006

060928c Intel researching chip with 80 FPUs
Well, I have mentioned before chips with 16-CPUs and now Intel reportedly is looking instead at more specialized 80-FPU chips:
Today we were given a chance to see the first prototype wafer they will be using for production of 80-core processors. Each CPU like that will have 80 simple floating-point dies and each die is capable of teraflop performance and can transfer terabytes of data per second. They claim that it will be commercially available in a 5-year window and will be ideal for such tasks as real-time speech translation or massive search, for instance.
To some extent the ClearSpeed CSX600 is already like that, with 96 FPUs each with around 6KiB of onchip memory for operands. It has a somewhat different usage profile, and I suspect that for now it costs a lot more than what would be the target for regular desktop usage.
060928b Fixing Belkin ADSL gateway bugs by using SMC firmware
I have been long suffering from the many terrible bugs in the firmware of my Belkin F5D7630 ADSL modem/router and I finally decided to take advice found on some self help forum and I loaded onto it the firmware of the equivalent SMC 7804WBRA which shares the same platform. Well, the SMC firmware, user interface and manual are way better, and starting from the same base (the web user interface code is still written in a way that I find quite terrible).
Belkin and SMC both selected the same low cost supplier for the product, which is common. Then it looks like that the Belkin buyers are paid bonuses on how much money they save, and piffling inanities like quality control cost money and reduce bonuses. I guess that somebody at SMC still cares about quality control, thus the various functions were tested and the bits that did not work fixed, and professional documentation was commissioned. I am going to buy SMC in the future rather than Belkin.
060928 Configuring 'automount' to use '/etc/fstab'
So I have decided to use automount I have realized that I sometimes still want to disable it and mount things manually and permanently. Unfortunately the format of automounter maps different from that of /etc/fstab therefore I would have to maintain the same information twice. Fortunately the automounter can synthetize map lines on the fly by invoking a script, and it is pretty easy to select the relevant /etc/fstab line (to which noauto should be added of course if necessary) and turn it into an automounter map entry dynamically, with a script like this:
#!/bin/sh

SP='[:space:]'

: '
  Copyright (C) 2006 PeterG. This program is free software: you
  can redistribute it and/or modify it under the terms of the
  GNU General Public License as published by the Free Software
  Foundation, either version 2 of the License, or (at your
  option) any later version. This program is distributed in the
  hope that it will be useful, but WITHOUT ANY WARRANTY; without
  even the implied warranty of MERCHANTABILITY or FITNESS FOR A
  PARTICULAR PURPOSE.

  We need to look up the key to find type, options and device to
  mount. This means we need the name of an "fstab" file and the
  prefix under which the key appears in it. These can be different
  depending on the "autofs" mount point, under which this script
  is run.
'
case "$PWD" in
*)	FSTAB='/etc/fstab'; FSPFX="$PWD/";; 
esac

KEY="$1"

grep '^['"$SP"']*[^\#'"$SP"']\+['"$SP"']\+'"$FSPFX$KEY"'/\?['"$SP"']' \
  "$FSTAB" | if read DEV DIR TYPE OPTS REST
    then
      case "$TYPE" in ?*)
	case "$OPTS" in
	'') OPTS="fstype=$TYPE";;
	?*) OPTS="fstype=$TYPE,$OPTS";;
        esac;;
      esac
      case "$DEV" in *':'*) :;; *) DEV=":$DEV";; esac

      : '
	We must omit the key in a map program like this.
      	key -[fstype=TYPE,][OPTION]* [HOST]:RESOURCE
      '
      echo "-$OPTS $DEV"
    fi
and then running the automounter like this from /etc/inittab:
af:2345:wait:/usr/sbin/automount -t 300 /media program /etc/auto.fstab
By the way, too bad that automount is designed to always background itself, because if it did not it that line could have action respawn instead of wait.
I have extended the unmount timeout to -t 300 seconds instead of 60 because some server I use intermittently accesses some directories on some mounted filesystems, which causes a large number of mounts. This adds pointless lines to the log, and some filesystems like ext3 impose a periodic check every so many mounts. To avoid the latter I have actually disabled the force check every some number of mounts and retained only the one every some number of days.
060924c Some reports on dual core versus hyperthreading performance
Spotted some interesting comments in the mailing list for the PostgreSQL DBMS about price/performance of recent dual core AMD CPU chips and SMT Intel CPU chips based on the older Netburst microarchitecture. The first one definitely supports the AMD dual CPU chips vs. the
I have been able to try out a dual-dual-core Opteron machine, and it flies.
In fact, it flies so well that we ordered one that day. So, in short £3k's worth of dual-opteron beat the living daylights out of our Xeon monster. I can't praise the Opteron enough, and I've always been a firm Intel pedant - the HyperTransport stuff must really be doing wonders. I typically see 500ms searches on it instead of 1000-2000ms on the Xeon)
which is not unexpected, while the second mentions that for something as highly parallel as a DBMS running many transactions, Intel's Hyper-Threading seemed to work pretty well too:
Actually, believe it or not, a coworker just saw HT double the performance of pgbench on his desktop machine.
Granted, not really a representative test case, but it still blew my mind. This was with a database that fit in his 1G of memory, and running windows XP. Both cases were newly minted pgbench databases with a scale of 40. Testing was 40 connections and 100 transactions. With HT he saw 47.6 TPS, without it was 21.1.
This is rather unexpected, as the author hints, because Hyper-Threading tends to result, even on custom coded applications, something like 20-30% better performance (I did some work on that custom coding).
060924b IBM to deliver hybrid Opteron and Cell cluster
I have just noticed that IBM is building a supercomputer as a cluster (all supercomputers today are clusters) of elements with a similar number of Opteron and Cell BE CPUs. Quite astonishingly the system will be built out of distinct cluster elements, with System x3755 fairly impressive Opteron based racks and BladeCenter H blade carriers with likely these Cell BE based blades.
This combinations is a bit surprising because of a major and a minor reason:
  • That the Opteron and the Cell BE will not be on the same board, but in entirely separate computers. I could have imagined using the Cell BE as a coprocessor for the Opteron.
  • That not only the Opteron and Cell elements will be distinct, but that the Opteron ones are based on a different physical format, even if there are pretty good looking Opteron based blades for the BladeCenter H infrastructure used for the Cell based systems.
The press release says that The machine is to be built entirely from commercially available hardware so that explains why no custom dual Opteron and Cell BE boards, but the BladeCenter H is commercially available. I suspect that with several thousand racks involved a common physical format is not that important.
Having a hybrid cluster with elements of two very different architectures is going to be a challenge, but IBM boasts that the issues will be solved with a new software infrastructure:
Roadrunner's construction will involve the creation of advanced "Hybrid Programming" software which will orchestrate the Cell B.E.-based system and AMD system and will inaugurate a new era of heterogeneous technology designs in supercomputing. These innovations, created collaboratively among IBM and LANL engineers will allow IBM to deploy mixed-technology systems to companies of all sizes, spanning industries such as life sciences, financial services, automotive and aerospace design.
which seems to be a remarkably optimistic statement.
060924 Web transaction test shows 64b slower than 32b
Some recently published benchmark report shows that running the same system with 32b Linux delivers more web transactions per minute with many clients then with 64b Linux. The advantage is not large, the peak for both is with 128 clients where 32b mode delivers 6,405 transactions per minute and 64b mode 4,932, but clear.
This is slightly surprising, because the CPU is an Opteron and these seem to perform a little better in 64b mode than in 32b mode, while Intel's Core and Pentium 4 with the AMD-compatible EM64T seem to perform noticeably better in 32b mode than in 64 bit mode.
By looking at the full graph reporting transactions per minute versus number clients one sees that 32b and 64b mode are equivalent up to 64 clients, the widest difference is with 128 clients, and there is again no difference no difference over 400 clients. Also and crucially the transactions per minute in 64b mode decline after 64 clients, while those of 32b mode are mostly constants with 128, 192 and 256 clients, after which they decline.
My impression is that this is due to memory being the bottleneck: the benchmark system has only 2GiB of main memory, and while many 64b applications tend to run a bit faster than 32b ones on an AMD chip, they take significantly more memory. Not quite a doubling, because in practice only pointers double in size, but still often significantly larger.
What looks like from the graph is that in 32b mode maximum utilitization of the CPU, IO and net is achieved starting with 128 clients, and memory then becomes the limiting factor above 256 clients; while in 64b mode memory limits performance above 64 clients.
Perhaps the limit is reached because of paging, or instead because of higher memory traffic. I suspect to some extent the latter; the motherboard is the Tyan Thunder K8SR S2881 seems good with 4 memory sockets per processor socket, but I wonder whether the test system had its 2GiB as 2x1GiB sticks (no dual channel per processor) or 4x512MiB.
Another benchmark report on the same site illustrates the effect of different memory sizes on a similar workload, and it shows that transactions per minute drop precipitously rather than gently when it exceeds the size, rather than the bandwidth, of available memory. This would tend to suggest that the issue is more memory bandwidth.
Curiously the graph of transactions per minute versus number of clients strongly resembles both the profile and absolute level of that for the Opteron benchmark in 64b mode, even if it is for Athlon 32b only CPUs. The motherboard used is the Tyan Thunder K7 S2462. Being designed for Athlon MP processors it has a much weaker memory subsystem: 4xDDR266 sockets instead of 8xDDR400, and chipset managed multiprocessing and memory controller rather then HyperTransport and CPU integrated memory controller. Also note that the Athlon MP CPUs have significantly less cache than the Opterons.
My guess is that the S2462 in 32b mode and the S2881 in 64b bit mode both hit the maximum memory bandwidth at 64 clients, while the S2881 in 32b bit has more headroom (like 20-40%) and hits it at around 320 clients. Probably this means that AMD realized that Athlon 64s and Opterons in 64 bit mode needed more memory bandwidth than Athlons, and thus pushed the memory subsystem specs so that 64b mode would not be starved for memory acces; which delivers the incidental benefit that the memory subsystem is in some sense oversized for 32b mode, and for memory bandwidth intensive benchmark this more than compensates for slightly slower 32b operations.
060920b Polipo, another nice proxy cache for small sites
While Apache with my patch is working well as a proxy cache, I also had a look at another one that seems especially promising:
Polipo is a small and fast caching web proxy (a web cache, an HTTP proxy, a proxy server) designed to be used by one person or a small group of people. I like to think that is similar in spirit to WWWOFFLE, but the implementation techniques are more like the ones ones used by Squid.
Polipo has some features that are, as far as I know, unique among currently available proxies:
  • Polipo will use HTTP/1.1 pipelining if it believes that the remote server supports it, whether the incoming requests are pipelined or come in simultaneously on multiple connections (this is more than the simple usage of persistent connections, which is done by e.g. Squid);
  • Polipo will cache the initial segment of an instance if the download has been interrupted, and, if necessary, complete it later using Range requests;
  • Polipo will upgrade client requests to HTTP/1.1 even if they come in as HTTP/1.0, and up- or downgrade server replies to the client's capabilities (this may involve conversion to or from the HTTP/1.1 chunked encoding);
  • Polipo has complete support for IPv6 (except for scoped (link-local) addresses).
  • Polipo can optionally use a technique known as Poor Man's Multiplexing to reduce latency even further.
In short, Polipo uses a plethora of techniques to make web browsing (seem) faster.
The attention to detail in the features listed above is notable. I am still using Apache as a proxy cache though, because I have to run it anyhow as a web server, and seems to be adequate.
060920 Using Apache 2.2 as a proxy cache, with patch
I have been using for years mostly Squid as my web proxy cache but also occasionally Apache which has proxying and caching modules (they are distinct: Apache can proxy without caching and cache non-proxy requests, and both modules must be enabled to provide proxy caching like Squid).
The advantages of Squid are that it is rather more flexible, notably can filter URIs via a custom external program or script, and it scales to very large cache sizes. The main advantage of Apache as proxy cache is that one does not need to run an extra dæmon and it is simpler and smaller.
Anyhow I have been using for a while Privoxy for filtering (it can filter both URIs and the content associated with them as these examples show) and there are secondary differences between Squid and Apache as proxy caches that matter: Squid does not support IPv6 (even if there are somewhat dodgy IPv6 patches available) and Apache does, and Squid runs periodically and fairly frequently a cache cleanup thread, while Apache has a separate cache cleanup program that can be run at any time. The latter is is an advantage on laptops, where Squid's cleanup thread wakes up the disk too often.
So I enabled forward proxy mode in my Apache 2.2 configuration with something like:
<IfModule mod_proxy.c>
  ProxyRequests         on
  Listen                *:3128

  NoProxy               127.0.0.1
  NoProxy               192.0.2.0/24
  NoProxy               .sabi.co.UK
  ProxyDomain           .sabi.co.UK
  ProxyVia              on
  ProxyTimeout          50

  <Proxy *>
    Satisfy             any
    Order               deny,allow
    Deny from           all
    Allow from          127.0.0.1
    Allow from          192.0.2.
  </Proxy>
</IfModule>
The supposed effect of the above is to allow proxying (on port 3128 too, but then Apache does both local service and proxy service on all ports); servers with local addresses are not proxied, and only clients with local addresses are allowed to proxy (unrestricted proxying is a very bad idea). Even without caching Apache proxying has a value in a mixed IPv4 and IPv6 environment, because Apache has full support for both protocols, and it can proxy between them, so that IPv6-only clients can access IPv4-only servers and viceversa.
Then I enabled caching too with something like this:
<IfModule mod_cache.c>
  CacheDefaultExpire            3600
  CacheMaxExpire                86400
  CacheLastModifiedFactor       0.1
  CacheIgnoreNoLastMod          on
  CacheIgnoreCacheControl       off

  <IfModule mod_disk_cache.c>
    CacheEnable                 disk http://

    CacheRoot                   /var/cache/httpd
    CacheDirLength              1
    CacheDirLevels              2
    CacheMinFileSize            1
    CacheMaxFileSize            900000
  </IfModule>

  <IfModule mod_mem_cache.c>
    CacheEnable                 mem /

    MCacheSize                  10000
    MCacheMinObjectSize         1
    MCacheMaxObjectSize         200000
    MCacheMaxObjectCount        2000
  </IfModule>

  CacheDisable                  http://[::1]/
  CacheDisable                  http://127.0.0.1/
  CacheDisable                  http://localhost/
  CacheDisable                  http://ip6-localhost/
  CacheDisable                  http://example.com/
  CacheDisable                  http://.example.com/
</IfModule>
The supposed effect of the above is to allow disk caching of proxy requests involving HTTP URIs (FTP could also be cached, but I don't have a use for it) and memory caching for content served by Apache itself; and to permit no caching of any content on the local server or any server within my own domain.
Then I noticed that nothing was getting disk cached because of long standing issues with URI matching in the CacheEnable and CacheDisable directives:
  • The / pattern would match both local and proxy URIs.
  • No way to match all subdomains of a domain.
  • Anyhow, proxy URIs could otherwise only be matched by fully specifying the hostname part of the URI, that is http:// would match no URIs.
These issues were at variance with the relevant documentation or just did not make much sense. So I had a look and found that there was a known bug, and then I decided to fix those issues myself, and rewrote the relevant code and submitted a patch for a much improved and somewhat extended URI matching.
060917c IPv6 prefix non-portability and unique local addresses
In a blog entry on unique local IPv6 addresses there is an interesting but flawed argument for private address ranges in IPv6:
they are stable (because they don't depend on your ISP),
Here the argument is that in IPv6 all global unicast address range allocated to users are not portable, because while IPv4 routed each subnet independently, IPv6 only allows routing by aggregation, that is hierarchically. The argument above, expanded, looks like:
  • If one gets an IPv6 prefix from an ISP, assigns addresses under it to nodes, and then changes ISP, one gets a different prefix, and has to change the address of all nodes.
  • If one uses from the beginning a private prefix, changes in ISP only require changes at the border gateways that remap the private prefix into the changing global one, not all nodes.
This argument is flawed because in the first case if the public prefix changes, it is well possible to start mapping the old global prefix into the new one at the border gateways. In the second case one has to map the private prefix to the public one at all times, just in case the public prefix changes in the future, in the second case one needs to map the old public prefix to the new one only if there has been a change.
Again the case for private prefixes is based on the desire of network managers to force all nodes to access the Internet via controlled gateways; this depends critically on those prefixes not being globally routable, rather than them being private or portable.
060917b Reintroducing private addresses, that is NAT, in IPv6
One of the advantage of IPv6 is that the abundance of potential addresses means that there is never a need to use private address ranges and thus to perform NAT which is a something with grave implications for networking. Unfortunately some network and system administrators have been clamoring for private addresses and NAT in IPv6, because they like the control and the modest degree of isolation and thus partial security that NAT gives in an IPv4 context. IPv6 used to have site local addresses which however have been deprecated as ambiguous; because the problem then exists of what to do with nodes that are homed on multiple sites, and the complications associated with zone identifiers suffixes. The solution to the perceived demand for private IPv6 addresses has now become the comically named unique local unicast addresses, which raises two questions: if they are local, why do they need to be unique, and if they are unique, why should they be local?
My impressions is that unique local addresses are a fudge to satisfy the demand for private address space. It is a fudge because the modest advantage of IPv4 private addresses and NAT is precisely that they are ambiguous, and thus require NAT. The benefits obtained by some network administrators from private address spaces are that:
  • Since IPv4 private addresses are ambiguous, and they are drawn from well known address ranges, even if datagrams carrying them leak onto the Internet they are not routable because no Internet router (hopefully) will accept routes for those well known address ranges, and most will simply drop packets to or from such addresses (which are listed in well known bogon list).
  • Since IPv4 private addresses are ambiguous and thus not routable on the Internet, a node with only a private address must use a NAT proxy to access the Internet, and viceversa, and this means that administrators can control and monitor all traffic from and to the Internet for nodes which only have an IPv4 private address.
In other words the side effect of private IPv4 address ranges is the ability to define separate, isolated internets (called intranets sometimes) with one way address translation to the global Internet.

Note: it is of course possible to create a separate intranet which reuses the full 32 bit address space of IPv4 instead of just the private address ranges. But while the latter only requires NAT one way (because the intranet and the Internet address sets then do not overlap, only the intranet addresses at different sites overlap), a full separate internet would require two way NAT, mapping in-use Internet addresses to a subset of the intranet's address space, as well as in-use intranet addresses to a subset of the Internet's address space, and addresses in the DNS protocol would have to be mapped too.

The reason why unicast site local addresses have been deprecated was not that they were ambiguous, but that they were ambiguous and without NAT: because if there had been NAT then nodes could not be really belong to multiple sites ambiguously. The zone suffixes were an attempt to disambiguate the site locale addresses, but using relative instead of absolute addressing, and as Robert Stroud's thesis showed if absolute addresses work but do not scale, relative address scale but do not work.
The new proposal for unique local addresses turns mostly ambiguous relative addresses into mostly unambiguous global ones by inserting in each address a randomly generated site specific prefix, where it is unlikely that two prefixes get the same random number, which is after all Robert Stroud's conclusion in his thesis:
The only alternative is to sacrifice a deterministic notion of identity by using random identifiers to approximate global uniqueness with a known probability of failure (which can be made arbitrarily small if the overall size of the system is known in advance).
But why bother for IPv6? Why not allocate a proper prefix? There is very little difference between that and a (partially) randomly generated one. The control (and weak security) of IPv4 private addresses is derived mostly from the necessity to use a NAT proxy to gain any access to the proper Internet, not from ambiguity in itself.
This need can be enforced by any site administrator even with globally unique addresses by not publishing routing tables for the prefix addresses that should not be routable on the Internet either way. What about accidental routing table leaks? To help with this, a prefix, or another distinguising feature of the address, could be allocated with the convention that Internet routers would not accept routing tables for subprefixes and would drop any packets to any address under that subprefix.
So for example the JA.net UK network have been given the subprefix 2001:0630:0050::/48, and they could have been given 2001:0630:0051::/48 as well, with the idea that they would never publish any routing table for the second prefix; or been given something like fc00:0630:0050::/48 as well, with the idea that no Internet router would route datagrams with addresses under fc00::/16. This would force any node with addresses under 2001:0630:0051::/48 or fc00:0630:0050::/48 prefixes to route via a proxy under 2001:0630:0050::/48 to communicate with the Internet, with the proxy having full flexibility on how to remap (NAT) either internal prefix to the externally routable one.
Of course this would have been possible with IPv4 too, and no need would have arisen for the IPv4 ambiguous private address space, except that it would have required the allocation of many more addresses. Private addresses in IPv4 have served mainly to conserve publicly routable address allocations, not to create isolated intranets, and given that IPv6 has no scarcity of address space, private addresses of any form are not needed to create isolated intranets in IPv6.
060917 Weird and not so weird long standing bugs
I was chatting with someone about software reliability and the free software development process, based on the social definition of works, and I made some examples of long standing bugs in software that I use:
  • In XEmacs, often, after doing a regular expression based search and replace, the next search and replace does not match its target. If I repeat it identically it does.
  • In KDE's Konsole after a while I use the 8 key stops having an effect.
  • In Fedora the Yum dependency manager as a rule does not clean after itself, even if one uses the clean all command, sometimes leaving hundreds of MiBs of data in its cache.
Why do these happen? Well, I guess because few people have encountered them before, or in the case of Yum, most hardly notice the issue. Why do they happen to me? Probably because my usage patterns are a bit different from the social average. My XEmacs configuration file is large, and I use Konsole with non default options, for example.
Why haven't I fixed them yet? Because they are not fatal errors, and I don't like to maintain my own patches, and submitting them to the official maintainers is often futile. Or even worse than futile if they contain design fixes and improvements, because they are perceived (entirely understandably) as criticism, resulting in very time consuming discussions.
060916 Large IPv6 datagram sizes and why they matter
In a previous entry I mentioned that one advantage of IPv6 is that it consolidates some useful extensions added as options to IPv4 over the year, most importantly large TCP6 window sizes, which are very useful for getting better TCP performance on high bandwidth and high latency links. But there is a bigger story as to that: it adds one option to permit IPv6 datagrams, that is the MTU for TCP6 and UDP6, to be larger than 64KiB. This is important for very high speed links, not just for high speed and latency ones, whether regional, continental or intercontinental, for example 10gb/s links.
To fully utilize a 10gb/s link with 64KiB datagrams one must be able to route 20,000 packets (assuming that one datagram is in a single packet, which will be the rule in IPv6) per second, roughly 50 microseconds per 64KiB, no matter what the latency is. That can be sort of expensive, because routers and network interfaces have two limits: number of bytes per second they can shift, and number of packets per second. Now the latter is often more of a limit than the former, because it requires expensive operations like interrupts, flushing caches etc., so on very high bandwidth links it is useful to have large packet sizes.
The same issue arises on a smaller scale with already common 1Gb/s networks and cheap cards and switches. Most cheap 1Gb/s cards and switches can only process Ethernet frames up to 1500B, which unfortunately does not allow actual transmission speeds anywhere near the 1Gb/s limit. One should use jumboframes of 9000B instead. Note that however since jumbo frames are an Ethernet extension they don't work that well unless all the connected equipment can support them. While IPv6 jumbograms are part of the specification, so they should be handled (even if not supported, because there is no requirement to support packets larger than 1280B) by all IPv6 implementations.
As discussed in RFC 1263 extensions are a somewhat dangerous activity: because while they preserve backwards compatibility, they also introduce excessive variation and are prone to incomplete or poor implementation. RFC 1263 offered the design of new updated protocols as an alternative. It turned out that this was impractical in the short term, but with IPv6 it demonstrated
Thus one way of lowering the per-packet overhead costs one obvious way is larger packets. There are two big problems with larger packets though, the first is that adaptive packet length configuration, called path MTU discovery, does not work well in IPv4, (largely because it was a later extension to IPv4 and many routers are misconfigured for it) and the other is that IP datagram size in IPv4 is limited to 64KiB. These issues have become increasingly frustrating and are discussed forcefully at the large MTU advocacy site in which MTU and datagram sizes of several hundred KiB are advocated for 10gb/s links, with a target of around 500 microsecond per packet, or 2,000 packets per second.
And this is where IPv6 has the advantage: it is the only widely available protocol that supports large datagrams and path MTU discovery (as well as large window sizes and UDP datagrams) being built in and not as an extensions, in large part ironically because IPv6 has been so long in coming.
060915 Thousands of queries per second for private addresses
NAT and private IPv4 addresses (for example those in the 10.0.0.0/8 range) are a crime, and an additional reason why is in this FAQ about the blackhole DNS servers:
The blackhole servers generally answer thousands of queries per second. In the past couple of years the number of queries to the blackhole servers has increased dramatically.
It is believed that the large majority of those queries occur because of "leakage" from intranets that are using the RFC 1918 private addresses. This can happen if the private intranet is internally using services that automatically do reverse queries, and the local DNS resolver needs to go outside the intranet to resolve these names.
For well-configured intranets, this shouldn't happen.
Usual problem here: the average network administrator will not waste his precious time figuring out how to setup reverse mappings for the private address range he uses, especially if someone else pays the cost. Also, this reminds me of an article on Internet root DNS servers reporting that one of the most common queries to them was for the top level domain WORKGROUP (never mind all those desktops with MS Windows 2000 or later trying to register themselves via DNS in the root servers). The lesson learned by Michael Stonebraker long ago about Ingres catalog indices:
Users are not always able to make crucial performance decisions correctly. For example, the INGRES system catalogs are accessed very frequently and in a predictable way.
There are clear instructions concerning how the system catalogs should be physically structured (they begin as heaps and should be hashed when their size becomes somewhat stable). Even so, some users fail to hash them appropriately.
Of course, the system continues to run; it just gets slower and slower. We have finally removed this particular decision from the user's domain entirely. It makes me a believer in automatic database design (e.g., [11]).
has been lost in time, like so many other others.
060914b Game load times, fragmentation; reporting to base
Another fascinating quote from the previously mentioned interview with Valve is about poor performance in loading levels due to poor filesystem locality (which matters under GNU/Linux too):
For an example of that on the customer side, we want to improve performance. The engineers said: "Rather than guessing what's bottlenecking performance, let's go and measure what's actually going on." We instrumented all the Steam clients, and the answer was surprising. We thought that we should go and build a deferred level-loader, so that levels would swap in. It turned out that the real issue was that gamers' hard drives were really fragmented, and all of the technology we wanted wouldn't have made a difference, as we were spending all our time waiting on the disk-heads spinning round.
Note also the implication: installing Valve games (at least via their Steam service) means installing some software that does an exhaustive scan of the hard disk and reports its contents to Valve. This is most likely well advertised in the license.
060914 Episodic game content four times cheaper to develop
Noticed an interesting bit of an interview with the CEO of Valve, the developers of the Half-Life series, about cost of development of major game titles:
The solution that we're trying is to break things into smaller chunks and to do them more regularly. So far, it seems to be working. When we look at how long it took us to build a minute of gameplay for Half-Life 2, versus how many man-months it takes us to build a minute of gameplay for Episode One or Episode Two, we seem to be about four times as productive. But we'll go through all three episodes to see... We sort of made a commitment to do it three times and then assess.
That four times more productive is a big thing. It reminds me claims by Mark Rein of Epic that games based on their Unreal engine development tools only require about 15 developers. My impression is that the case they are making is not about episodic content or development tools, it is that games that are really mods, reusing a lot of the engine and art of previous games, are much cheaper to develop than games developed from scratch. Which probably is possible because most platforms have sort of leveled in terms of functionality, so game engines don't have to be rewritten from scratch frequently, and thus can become somewhat stable platforms. Never mind middleware, whether third party libraries or tools: a game engine is in effect its own middleware, when used as a modding platform.
The business implications are tremendous too: Valve made a lot of money with Half-Life and Half-Life 2, who were both developed from scratch and retailed for around $40-50. Now they are releasing episodes that retail for around $20-30 and where a major component of development cost is one fourth. Developing mods can be very profitable. Especially when they release them via Steam: they get the whole price, instead of a fraction of it when released retail. No surprise that Valve is rather cautious when talking about Steam:
What is the split between sales of Episode One via Steam and boxed sales?
Gabe Newell: That isn't something that we've talked about. It's something we're keeping to ourselves.
So, how do you manage your relationship with EA when you're selling games via Steam?
Gabe Newell: Our relationship with EA is fine. I think that retailers are really frightened of these kinds of changes in the industry, and I think that we're learning stuff that is going to be very important for them. For example, Steam enables new ways of doing promotion: [ ... ] I think retailers are starting to understand that communicating more efficiently with customers is a way, not of taking money away from them, but of driving people into stores. It's not a way of cutting them out of the equation.
Fabulously disingenuous argument about promotion: the goal of online game delivery is not to promote games, for which one can do web sites and demos; it is to cut out the middleman as much as possible, and that is also why massive online games like World of Warcraft are so enormously profitable for game developers (and most are not available for consoles, where the console brand owners really insist on getting a large cut). If online content becomes even more popular then games publishers will be reduced to the role of marketing and PR agencies, not resellers, and the other role that they perform, project venture capitalists, will probably split off into independent entities.
Another point he makes is about how dominant in business terms is World of Warcraft:
Right now, I think the benchmark game in the industry is World of Warcraft, and every platform could be measured against its ability to give advantage, or fail to give advantage, to building a better World of Warcraft.
a notion that I have previously discussed as that game probably explains a lot of the falling sales of other PC games. As to that, I supect that if World of Warcraft were available on consoles, the sales of other console games would be impacted too.
060913b Socket duty cycles matter
I am quite pleased to see that the designers of FireWire considered the important mechanical problems in plugging and unplugging cables, something that not everybody understands is a significant issue (from Michael J Teener):
One of the primary features is that the moving parts were all in the *plug* (all connectors have moving parts ... those little springy bits that apply pressure and make sure the connection is good and tight). This way the part of the connector/socket system that wears out is in the cable. When something goes wrong (as it will in any mechanical system), you throw out the cheap, easily replaceable component ... i.e., the cable.
I was talking about this recently again in the context of a large scientific facility, which expects a lot of visiting scholars for relatively short periods of time. My argument was that for the user facing part of the network wireless network is a good idea (security can be sorted out) simply because it avoids a lot of plugging and unplugging of cabling by hasty and not very careful people, which is especially damaging as the insertion cycles to which many sockets are designed are pretty low.
In general an important difference in the quality of different connectors is their design insertion cycle limit. This can be for USB, or RAM, or CPU, or PCI, AGP sockets. Low quality connectors can break even after only half a dozen insertions. It is one of those aspects of quality that only careful buyers consider. Some people think that one is not going to do that many insertions, for example into RAM sockets. But testing a defective RAM stick can change that calculation quite a bit, and so on for upgrades.
060913 Astonishing FireWire security vulnerability
I was reading the Wikipedia article on FireWire (also known as IEEE 12394 and i.Link) and I was astonished to belatedly learn that most FireWire host adapters have a large security issue: by default they allow any device to read or write any address of main memory. This does not seem grave, because after all usually it is the kernel that sends commands to the device instructing it where to read and write memory, in other words operations are initiated by the kernel. However, FireWire is essentially a peer-to-peer system, where operations may be initiated by any connected device; indeed it can be used to connect two (or more) systems directly, as if it were a network link. In which case they have full access to each other's memory, which is a bit too loose.
It is also amusing that this little issue has been proactively turned into a special case advantage: a FireWire link can be used as a memory debugging tool:
This feature can also be used to debug a machine whose operating system has crashed, and in some systems for remote-console operations. On FreeBSD, the dcons driver provides both, with using gdb as debugger. Under Linux, firescope and fireproxy exist.
These are classic examples of the if life gives you lemons, make lemonade principle.
FireWire remains still vastly preferable to USB2, as it is a much better defined protocol with much more reliable implementations (notably from Oxford Semiconductor (but also arguably very bad ones, for example the notorious Prolific PL3057). It is a pity that Apple decided to start asking for licensing fees on FireWire, triggering others to do so, which has made Intel and other design and adopts USB2.
060911 IPv6 options for IPv4 interoperability
Talking again about IPv6, what are the options for IPv6 connectivity? Well, many, but most are obsolete or not very practical. IPv6 adoption is ever rising (slide 6), but mostly for corporate or government use. Overall the practical methods of getting IPv6 connectivity for home users are:
  • Direct IPv6 support by the ISP: Some ISPs support it directly, for example in the UK there are Andrew & Arnold at a retail level, Bytemark for web hosting, and UK6X (an offshoot of BT) at a wholesale level. ISP direct support in theory allows configuring an IPv6 only system, but since most servers are still IPv4 only, it is necessary to configure a dual stack as those ISPs that do support IPv6 rarely also provide an IPv4-to-IPv6 proxy.
  • IPv6-in-IPv4 using protocol 41: this can use the convenient 2002 prefix encapsulation for IPv4 addresses and then it is usually autorouted or explicitly configured and routed via a tunnel broker. Except in rare cases it is not supported by consumer grade ADSL/cable gateways, but can be used if one connects to the Internet via a simple PtP connection via a modem (either a phone or ADSL one). Autorouting usually results in poor performance, so registering with a 6in4 tunnel broker is usually better.
  • IPv6-in-UDP (or TCP or SCTP, but UDP is much better, which means using the Teredo protocol, supported natively by MS Windows and by Miredo under GNU/Linux, or the AYIAY scheme, supported by AICCU. These schemes rely on the fact that UDP must work for Internet access to happen. The overhead of encapsulating both within UDP and IPv4 is regrettable, but buys simplicity.
Thus for most home users Teredo or AYIAY will be the only choice, for those using direct modem connections or from a webhosting ISP protocol 41 encapsulation will be easiest, usually via a tunnel broker, but sometimes autorouted is good too. The very lucky ones have ISPs and web hosts that support IPv6 directly, and an ADSL/cable gateway that routes IPv6 too.
060910b QoS shaping and a nice paper with a strange setup
As the author of sabishape I was interested to find a nice detailed discussion about shaping and QoS. The author makes some useful points, for example about jitter limitation requiring a far higher rate of examination on low bandwidth links, and it is somewhat surprising how high a rate is needed:
The only timer source of use at SME bandwidths are the high performance timers available in post-PentiumPro CPUs. These allow bandwidth estimation and policying at speeds lower than 64K on FreeBSD (with the HZ raised beyond 2KHz) and lower than 128K on Linux (at 1KHz HZ).
and that high overhead, interrupt-per-packet cards are better for that purpose:
At the same time, cards that are considered "horrid" like Realtek 8139 (rl) provide much more interrupts and much more scheduling opportunities. As a result they provide considerably better estimator precision and policy performance. The difference is especially obvious on speeds under 2MBit. It is nearly impossible to achieve a working ALTQ hierarchy where some classes are in the sub-64Kbit range for a 2MB (E1) using Intel EtherExpress Pro (fxp). It is hard, but possible on a Tulip (dc). It is trivial to do this using Realtek (rl).
Linux estimator entry points differ from BSD. As a result, the effects of hardware are less pronounced and system is more dependant on the precision of the timer source. Still, the same rules are valid for Linux as well. QoS on low bandwidth links (sub-2MB) cannot be performed on server class network hardware.
The reason is that if we want to be sure that the maximum bandwidth used is under a certain level, the more variable it is, the lower the average must be to ensure the limit is not crossed. The overall discussion is quite agreeable, even if I think that:
In reality the diagram is likely to contain 16-20 classes for an average company or 5-10 classes for an average home office network.
exaggerates a bit. Because really one can do one class per source of traffic (e.g. sharer of the link) and three classes per type of traffic, like low, medium, high. More classes just make the situation more complex and don't really work that well especially if there is little bandwidth to share.
Then I was a bit surprised by this point:
HTB is not a good choice for an SME or hobby network. Bandwidth will not be utilised fully and the link efficiency is considerably worse. Its only advantage is that its "ease of understanding" and "predictability" are easier to express in a subletting agreement.
HTB is a a looser policy than CBQ (both also described in this Linux specific HOWTO) but it should not lead necessarily to lower utilization. If one specifies the ceiling for each class as the limit for the whole link.
I also found the sample configuration a bit odd because it only shapes incoming traffic, where shaping is really necessary, and yet not so effective, when it is outgoing, in part because the assumption is that no traffic goes to the host doing the shaping, only to hosts behind it. But still it is quite surprising. Because incoming traffic cannot really be shaped by queueing: only by dropping packets.
Shaping incoming traffic by queueuing at the time of sending them on only has the effect of creating large queues on the shaping host, which eventually fill up and cause packets to be dropped, but likely in a less favourable pattern than if the dropping is done periodically by the ingress discipline. The goal of dropping packets on incoming is in effect to simulate a congested link and thus trigger quenching at the source; dropping packets regularly as ingress does simulates a constant, steady bandwidth limit. But letting incoming queue up and then drop them as the queue becomes full simulates a full bandwidth link that occasionally becomes very congested, and congestion control algorithms don't react well to that. I'll ask the author of the site about his reasons for that.
060910 Using IPv6 for a large site
I was recently discussing with some interesting people the use of >IPv6 for a large scientific site with a vast internal network infrastructure and both internal and external users. The main advantages of IPv6 are:
  • Lots of addresses. Never do NAT.
  • Better autoconfiguration, especially nice for temporary and mobile clients.
  • Better performance over high bandwidth and/or high latency links (jumbograms, incorporates the common TCP extensions).
  • Fairly easy to convert applications to use it, alone or with IPv4, by using getaddrinfo.
  • Cheaper to process in high traffic routers.
The main disadvantages are:
  • It is completely incompatible with IPv4, which means that various encapsulation options are needed to pass IPv6 traffic through IPv4-only networks.
  • Its security architecture is based on IPSEC which is complicated.
  • Minimum packet length is somewhat longer.
  • Addresses are quite long to write.
  • Having globally unique addresses without NAT makes it much easier to track usage by PC, which often means by user.
  • The dynamic autoconfiguration abilities are sort of pointless if one has to configure the same host for both IPv6 and IPv4 operation, because then the IPv4 must be configured explicitly anyhow.
There are also a couple of serendipitous security advantages:
  • It is possible to assign IPv6 addresses where the lower 64 bits are randomly generated, thus creating a fairly strong measure of security via sparse capabilities against networking scanning attacks.
  • More weakly, given the relative scarcity of IPv6 targets, security attacks against IPv6-only hosts are rather more unlikely than against IPv4 hosts.
Overall IPv6 is a win, except for the lack of connectivity at ISP level. There is also a peculiar advantage: MS Windows Vista is by default dual IPv6 and IPv4, and IPv6 connections are attempted first if an IPv6 address can be discovered for the target. This will cause some delay if there is no IPv6 connectivity at the source. So better to have it anyhow.
There a few distinct IPv6 based deployment scenarios (some discussed in RFC 4057), and at the time my opinion was that especially in a large scientific environment one might as well have all servers and clients and services on both IPv6 and IPv4. Well, on reflection it would be better to have all internal servers (and clients) to be IPv6-only, and have dual IPv6 and IPv4 only on externally visible hosts. Because IPv6-only adds the extra serendipitous security described above, and is feasible, and only requires one configuration.

August 2006

060828 A new 'init' design from Ubuntu
Having recently discussed the init dæmon it is pleasing to see that the Ubuntu project is replacing the standard System V style one with a redesigned one called upstart for which they have written a neat paper describing it and comparing it with similar projects like OpenSolaris SMF, initng or Apple's launchd.
My first impression is that most are a bit misguided because most attempt to unify a too wide notion of service management, merging the functionalities of init, cron and inetd (and more). Never mind that launchd uses XML for configuration files...
This is quite incorrect, especially as to including the inetd functionality, because it has been traditional to allow a UNIX like system to startup without any networking, for various reasons.
Then there is the non trivial problem that there is essentially no precedent in UNIX like systems for service dæmon management: traditional init (almost) just runs enough scripts to reach some state of readiness and back at shutdown, cron just runs commands, not dæmons, and inetd really manages sockets, not services.
However something good may come out of this mess. Even if I detect signs of bad news. For example for upstart:
In fact, any process on the system may send events to the init daemon over its control socket (subject to security restrictions, of course) so there is no limit.
which seems to indicate that it is yet another mess like udev. And it was pretty scary to read that some people wanted to integrate upstart with D-Bus, another mess. Overall I think initng is the more UNIX like solution, simple and dependency based, but event based upstart may be not too bad, even if it is already decided that it will use /etc/event.d as its configuration directory (UNIX style is not to have a .d suffix on directories). Let's hope that the socket is actually a named pipe, at least.
060825 Much better performance with direct HyperTransport interface
Quite impressive numbers in an article about network messaging comparing a cool InfiniPath HTX card with the same with a PCI-E interface; performance with the direct HyperTransport interface is several times higher in both throughput (10 times more messages/second) and latency (3 times lower) than with a PCI-E 8-lane interface. It is also interesting that HyperTransport now comes with a PCI-style slot design called HTX as well as the more familiar AMD style CPU socket interfaces.
AMD or Intel CPUs don't have anymore a coprocessor interface unlike MIPS style CPUs, but AMD CPUs have this great HyperTransport interface that is good enough (throughput and latency) to support inter-CPU communications in a multiprocessor, and can be used also by non-CPU chips as a general system bus, which can allow all sorts of tighly coupled coprocessors, network oriented or otherwise. Many years ago for example Weitek had line of extra-performance x87 socket compatible floating point coprocessors. One can now imagine putting all sort of things like that in the HyperTransport capable, or more probably HTX capable, AMD64 motherboards.
060822 AMD hopes for large market share in servers
One of AMD's vice-presidents has stated that the expects AMD to reach 40% market share in servers by 2009 which sounds fairly plausible. If AMD want to grow, they have to do that: laptop sales currently outnumber desktop and server sales in most countries, and Intel's overwhelming market share on laptops seems unassalaible, especially after the release of the Core 2 series. On servers AMD may have the advantage, because its on-chip links for the HyperTransport as part of the Direct Connect Architecture bus give it a considerable advantage for 4-chip and 8-chip systems, which do not therefore require the SMP chipsets that Intel CPUs need.
Part of the sotry is that until relatively recently Intel's market strategy for servers was based on the Itanium architecture whose lack of success has given a large opportunity to AMD. Conversely laptops have traditionally been mostly purchased by corporations, and corporations tend to prefer Intel over AMD chips, simply out of retentive habits.
It could well happen that Intel will end up dominating the laptop market, AMD the server market, and both will continue to sell in the shrinking desktop market, with the advantage to Intel as usual, mostly thanks to their new focus on low cost, integrated graphics, systems.
060821c The 32-bit Linux real and virtual memory boundaries
This is a little detail that should be probably discussed a bit more widely. The Linux kernel usually maps in each process address space both the virtual memory for that process and the whole of the system's real memory, as well as a workarea for itself. This means that on CPUs with 32 bit addressing, the total sum of virtual memory and real memory and system work area cannot exceed 4GiB.
The default system workarea is 128MiB, and the default per process virtual memory space is 3GiB, so the default real memory mapping window is 896MiB. What happens if the system has more than 896MiB? Well, they get ignored, or high memory support is enabled, and then real memory above 896MiB is mapped temporarily in a subwindow of the per-process kernel memory area of 128MiB, which involves a modest slowdown and some complications.
My reckoning is that up to a point it is better to map more real memory and have less per-process address space. The boundary between the two can be changed by defining the CONFIG_PAGE_OFFSET setting in the .config file when building the kernel, or for older kernels in redefining the macro __PAGE_OFFSET in the kernel header include/asm-i386/page.h before building it. The default is 0xC0000000 and the values that I think are sensible are:
Kernel process/real memory
space boundaries
Boundary Process
space
Kernel
space
Real memory
window
0xB8000000 2944MiB
3GiB-128MiB
128MiB 1GiB
0xC0000000 3GiB 128MiB 896MiB
1GiB-128MiB
0x98000000 2432MiB
2.5GiB-128MiB
128MiB 1.5GiB
0x78000000 1920MiB
2GiB-128MiB
128MiB 2GiB
0x38000000 896MiB
1GiB-128MiB
128MiB 3GiB
Of these I think that the most useful is 0x78000000, where the real memory map window is 2GiB and the per process address space is just under that, as this allows direct mapping of most desktop real memory sizes, and the 1920MiB per process address space is still pretty large, and it still allows the full amount of real memory to be used up by a single process.
The least useful is the default 0xC0000000, because unless high memory is enabled it wastes 128MiB of a by now common 1MiB real memory endowment, for the dubious benefit of a 3GiB per process address space. More useful would have been 0xB8000000 as at least it allows full mapping of the 1GiB, at the insignificant cost of 128MiB less of per process address space.
060821b Another chip with 16 CPUs
Having just seen an interesting 16-CPU chip it is not that surprising to discover another one by Boston Circuits are planning another 16-CPU chip with some hardware queue manager and some special hardware IPC:
BCI's gCORE processors employ a new Grid on Chip architecture which arranges system elements such as processor cores, memory, and peripherals on an internal "grid" network. gCORE is one of the first commercial applications of "network on chip" technology, that has been at the forefront of research at leading universities in recent years. It is widely accepted that traditional bus architectures are no longer valid for large scale system on chip implementations in the sub 90 nanometer geometries. Traditional buses become too large, and too slow to support 16 or more processor cores. Ease of use has been one of the biggest obstacles for the widespread adoption of multi-core processors. BCI has taken an unique approach of incorporating a "Time Machine" module in the chip to dynamically assign tasks to each of the processor cores. By alleviating the need to explicitly program each core, this approach greatly simplifies the software development process.
This chip is also designed to run Linux. Not surprising: just as the ready and cheap availability of UNIX source licenses considerably reduced initial cost to develop new minicomputers and workstation systems in the 1980s, Linux has done the same for small servers and PCs and even desktop boxes and ADSL routers in the 1990s.
060821 15 years of Linux
Linux® was famously announced 15 years ago as a hobby project. As discussed by Linus Torvalds and Red Herring in this interview it is also now a big business, especially for startups. I have been using it almost as long, since I was doing my degree.
In the computer industry 15 years is a long time, and it is even more remarkable that Linux is a compatible reimplementation of the UNIX® kernel, which has been around for over 30 years. But August 2006 is not only the 15th anniversary of Linux, but also the 1st of the closure of the UNIX department at AT&T. Indeed my previous OS was Dell's UNIX System V.4, and while it was not bad, GNU/Linux was rather more flexible, as well as free to modify and enhance, which is a big advantage. Perhaps Plan 9 should have become popular in its place; but at the time the license did not allow it. I was also fond, and perhaps fonder, of FreeBSD but at the time it was the target of a license case from AT&T, and anyhow its Berkeley-style license does not reward contributions as much as the GPL which means that I have easily resisted the temptation to switch to a *BSD distribution even if I think that they are technically more elegant (except for their package managers).
There have been several other UNIX kernel compatible reimplementations, and several other free kernels some of them quite unlike UNIX, but interesting. But for now and the near future it is going to be Linux (and a few other UNIX compatible reimplementations) for most of those who choose a Microsoft alternative.
060815 A chip with 16 CPUs
Given my long standing interest in parallel and vector processors, I was delighted to see an article on Movidis, a company selling a server based on a processor with an unusual set of tradeoffs: while clock speed is limited to around 600MHz, the chip has 16 processors, and a total power consumption of around 30W. If the 16 processors delivered under load processing power at say a 50% efficiency and thus 8 times the power of a single processor, the chip would have then equivalent performance to a 5GHz processor, which is pretty remarkable.
The CPU is the OCTEON CN3860 which seems to have been designed by networking company Cavium for multiple-packet-flow handling networking appliances.
With a few others I had expected processor chips to evolve in two different branches as silicon budgets have gone way up in the past 10-15 years, one being to use the budget for ever more complicated single-CPU chips with bigger caches, for backwards compatibility, and another branch of chips using tall those transistors to put a whole multiple-CPU system on a chip.
But multiple-CPU chips have only started happening relatively recently, and they have been about a few complicated CPUs on a chip, not many simpler ones. The OCTEON family is one of the few chips with a multiple-CPU architecture that goes for numbers. The others I have seen so far are Sun's 8-CPU UltraSPARC T1 and the rather more specialized 24-CPU (soon to become 48) Vega from Azul System.
What can be a MIPS-architecture based 64 bit general purpose chip best used for? Well, to process multiple streams of network packets, as Cavium does, or of media, as Movidis initially positioned it for. As to latter I was a bit perplexed because I designed and delivered a streaming video media server years ago with standard PC parts and it did not really need a lot of CPU power, and the first instance could easily cope with 20 simultaneous MPEG1 streams with a Pentium 100MHz chip. Media servers seem to me more disk bound than network bound, and not much CPU bound.
But as a general purpose processor it is really quite good for workloads with many users, and that for example means web servers. Which suits the somewhat network-oriented tradeoffs in the CN3860, as the 16 CPUs on this chip have no floating point coprocessor, and only 32KiB of level 1 instruction cache and 8KiB of level 1 data cache on them, and 1MiB of shared level 2 cache, which might be equivalent to 128KiB per CPU (but probably more).
For example currently a number of web hosting companies already use various software based virtualization techniques like Linux VServer, UML or Xen to provide 8-16 virtual partitions per hosting PC, and with an OCTEON (or UltraSPARC-T1) CPU they can provide a real CPU per user, which solves a number of issues, and for a lot less watts than an equivalent 3-5GHz x86 style single-CPU chip.
Another good use of such a chip is for build systems, even if compilers would use more cache than that provided on the CN3860. Lack of floating point of course makes it wholly unsuitable for most scientific processing tasks, but another suitable application area is image and data processing and compression, as a lot of relevant algorithms either do not require floating point or can be reformulated in fixed point, and the CN3860 has plenty of precision for that as its CPUs are fully 64 bit.
For multiprocessor scientific computing in-a-box there used to be Orion MultiSystems (cached home page) which sold boxes with 96 single-CPU Transmeta Efficeon single-CPU, low wattage chips, which however were not that good at floating point.
060810b Finer tagging of text in HTML and the semantic web
Having just written against the way microformats are used to tag data for automatic processing (as the Semantic Web is not quite here yet), I feel like confessions that I use the sensible alternative to microformats, which is finer resolution text tagging. The reason for this is that often there is a need to put in evidence both some parts of speech and point out types of discourse (usually in the special case, but not only, of levels of discourse, where the type is the degree of abstraction).
A part of speech is a word with a specific semantics role, for example a verb or a preposition. Most parts of speech do not need to be put in evidence because their role is obvious and implicit in the language. HTML already has some tags to indicate some parts of speech, for example abbr and acronym.
However often the parts of speech that are not obvious are proper names, because some proper names are not easy to distinguish from ordinary words (many are derived from them). In many languages there is some convention to indicate a proper name, and the one I use is to capitalize the first letter of each word of a proper name. I also reckon that the cite tag of HTML is the neatest way of tagging proper names in general. So I would write:
<cite>Smith</cite> is an engineer, not a smith.
Then however there are different types of proper names, and I wish often to be able to differentiate them. This is particularly valuable in talking about computer and business related matters, because often different entities, like companies and their products, have the same proper name; and many companies and products have all-lower case or mixed case names, and first letter capitalization cannot be used to indicate a proper name role for the word. For this I use not just the cite tag but also the class attribute to indicate what type of proper name is being tagged, and then a slightly different CSS rule to give them slightly different renderings. For example, I would write:
<cite class="corp">Oracle</cite>'s main product is
<cite class="thing">Oracle</cite>, and only an oracle can
predict whether they will release a similarly name Linux distribution.
because Oracle is the name of a corporate person and Oracle that of a thing. I have tried not to define too many classes of proper names, and currently the values I use for class of a cite element are corp for corporate persons), thing for objects, place for locations, and uom for units of measure.
A bit less consistently I also use cite to tag bibliographic citations, not just citations of proper names, thus I also have classes author for proper names of authors, title for names of the article or book, part for names of the specific part of a book, which for the name of the issue, publ for publisher names, and date for the date of publication.
As to types of discourse, it often happens to mix different abstraction and speech levels, for example as in quotations. HTML already has at least two generic tags for types of discourse, to indicate quotations, blockquote for large textual quotations and q for smaller quotations, and code for quoting code and a few others.
I tend to use different classes of q to indicate the type of discourse of single words or longer sequences, for example fl for foreign language, nlq for non-literal quotes, and toa for terms of art.
So for example I would write:
A <q class="toa">type of discourse</q> can be recognized
because it must be read in a non-<q class="nlq">plain</q> way
to get the correct meaning: in <cite class="thing">French</cite>
the word <q class="fl">chair</q> does not mean
<q>chair</q>, but <q>flesh</q>.
Apart from being useful to remove ambiguities, the use of finer pseudo-tagging of text has other advantages: it becomes a lot easier to search for stuff in text. For example, using tools like sgrep, a version of grep which has matching operators specifically for SGML style syntax.
060810 Fonts, antialiasing and low DPI
While trying to help some of usual people lost in the GNU/Linux fonts mess I point out that antialiasing is usually not a good idea for bitmap and well hinted outline fonts (which excludes also those designed to rely on subpixel antialiasing) because that blurs the character boundaries and makes focusing too hard, while the shapes of the characters are already pretty good. But I have to admit that there might be an exception, and I have been impressed by how increasingly common it is: if the display only supports low resolutions like 72DPI or 75 DPI then antialiasing might have some merit as at those resolutions glyphs are really rather pixelated. I always try to make sure that any monitor I use is at 96DPI or 100DPI or even preferably 120DPI, and starting at those DPI well hinted and bitmap fonts look indeed fine (and sharper) without antialiasing.
There are two reason why 72DPI or 75DPI resolutions are common. The first is that 19 inch diagonal LCD screens are increasingly common, and they almost all have pixel dimensions of 1280x1024 (instead of 1600x1200) and that means 72DPI. Why are LCDs manufactured to such low resolutions? In part perhaps to minimize rejects, but I suspect in large part because of the second reason: many people don't know that they can set font sizes in logical, resolution independent, terms, so that no matter the resolution glyph stay the same apparent size. So for the many middle aged, computer illiterate people out there a 72DPI screen looks like having bigger, more readable lettering, even if it is coarse.
Under both MS Windows and GNU/Linux and X it is regrettably somewhat involved to get glyphs scaled by the correct resolution. For X one has to inform the X server of the screen DPI, which can be done:
  • with the option -dpi dpi in the X server's command line (usually X server command lines are specified in the display manager's configuration file);
  • or by specifying the right screen area size in millimeters in the X configuration file's relevant Monitor section's DisplaySize directive.
In addition to this, the specifiers for the requested fonts must either include the same DPI value, or a special value which means to get it from the X server. For native X fonts specified with an XLFD I have written a summary of the issues.
Under MS Windows it is not possible to specify directly the font characteristics of GUI fonts, but it is possible to either ask for larger fonts or to override the system's default DPI, which sometimes is not computed correctly, or if computed correctly it is desired to override it with a higher one to get larger glyphs.
060809 XML and HTML microformats
Thanks to some (not appreciative) blog entry on them I have sadly become aware of HTML microformats:
Every once in a long while, I read about an idea that is a stroke of brilliance, and I think to myself, "I wish I had thought of that, it's genius!" Microformats are just that kind of idea. You see, for a while now, people have tried to extract structured data from the unstructured Web. You hear glimmers of these when people talk about the "semantic Web," a Web in which data is separated from formatting. But for whatever reason, the semantic Web hasn't taken off, and the problem of finding structured data in an unstructured world remains. Until now.
Microformats are one small step forward toward exporting structured data on the Web. The idea is simple. Take a page that has some event information on it -- start time, end time, location, subject, Web page, and so on. Rather than put that information into the Hypertext Markup Language (HTML) of the page in any old way, add some standardized HTML tags and Cascading Style Sheet (CSS) class names. The page can still look any way you choose, but to a browser looking for one of these formatted -- or should I say, microformatted -- pieces of HTML, the difference is night and day.
Ahhhhh the pain the pain, the memories :-). In a discussion a long time ago I was not happy with the tendency to use SGML or XML to define not markup languages but data description languages:
I'll spare myself the DTDs, but consider two instances/examples of two
hypothetical SGML architectural forms; the first is called MarkDown:

  <ELEMENT TAG=html>

    <ELEMENT TAG=head>
      <ELEMENT TAG=title TEXT="A sample MarkDown document"></>
    </>

    <ELEMENT TAG=body ATTRS="bgcolor" ATTVALS="#ffffff">
      <ELEMENT TAG=h1 ATTRS="center" ATTVALS="yes" TEXT="What is MarkDown?"></>
      <ELEMENT TAG=p>
        MarkDown is a caricature of SGML; it is an imaginary
        architectural form whose semantics are document markup, where
        the <ELEMENT TAG=code TEXT="tag"></> attribute is the one to
        which the MarkDown semantics are attached.
      </>
    </>
  </>

Now this is a monstrosity, but I hope the analogy is clear, even if a
bit forced in some respects.
and here I find a variant of that monstrosity. The examples of microformats provided are also particularly repulsive because of less than optimal choice of the HTML tags to pervert into data descriptions:
    <div class="vevent">
      <a class="url" href="http://myevent.com">
        <abbr class="dtstart" title="20060501">May 1</abbr> - 
        <abbr class="dtend" title="20060502">02, 2006</abbr>
        <span class="summary">My Conference opening</span> - at
        <span class="location">Hollywood, CA</span>
      </a>
      <div class="description">The opening days of the conference</div>
    </div>
as there is too much use of generic tags like div and span, and the abuse of abbr, where the data is in the title attribute and its verbose description as the body of the abbr element. Now trying to imagine the above data as text the second example might be less objectionably marked up as:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <style><!--
      dt			{ font-weight: bold; }
      cite.title:before		{ content: open-quote; }
      cite.title:after		{ content: close-quote; }
      cite.location:before	{ content: "["; }
      cite.location:		{ font-style: normal; }
      cite.location:after	{ content: "]"; }
      q.abstract:before		{ content: ""; }
      q.abstract		{ display: block; font-size:90%; }
      q.abstract:after		{ content: "."; }
      --></style>
  </head>
  <body>
    <dl class="vevent">
      <dt id="20060501-20060502">
	<a href="http://WWW.MyEvent/#opening">
	  <abbr class="dtstart" title="May 1 2006">20060501</abbr> to
	  <abbr class="dtend" title="May 2 2006">20060502</abbr></a></dt>
      <dd><cite class="title">My Conference opening</cite>
	<cite class="location">Hollywood, CA</cite>:
	<q class="abstract">The opening days of the conference.</q></dd>

      <dt id="20060503-20060504">
	<a href="http://WWW.MyEvent/#closing">
	  <abbr class="dtstart" title="May 3 2006">20060503</abbr> to
	  <abbr class="dtend" title="May 4 2006">20060504</abbr></a></dt>
      <dd><cite class="title">My Conference closing</cite>
	<cite class="location">Hollywood, CA</cite>:
	<q class="abstract">The closing days of the conference.</q></dd>
    </dl>
  </body>
</html>
It might seem similar, but it isn't: because my version is just text with text markup, structured as text, not data (and never mind the use of SGML and XML instances that contain only markup, with no data or text). The idea of microformats is to decorate structured data with pseudo-markup so programs can extract individual data elements more easily. As the already mentioned by this blogger this is an abuse of HTML, which is about text, not data, markup, where some XML instance would be more appropriate. However there is a decent case to be made for finer resolution markup of text so that parts of it may be easier to identify and extract, and doing this by some finer grain HTML markup is not too bad. Using the class attribute to indicate finer classes of semantics is a bit of an abuse, as they are meant to indicate finer classes of rendering, but one can make the case that different classes of rendering do relate to finer classes of meaning, at least in the eye of the average beholder.
060808 What is the resident set size, 'exmap', working sets
Someone was asking me what is the RSS column in the output of ps on Linux and I said that in theory is the number of resident pages the process has, but that is somewhat unsatisfactory.
There are two reasons for being unsatisfied with RSS, and one is incidental and the other deeper. The incidental one is that shared library (or other mappings) resident pages are accounted for in every process using them, with the result that the total sum of the RSS figures is larger than the memory actually used.
This counting problem can be partially avoided by using a tool which I recently discovered called exmap which accounts for share mapping resident pages in proportion to how many processes share them:
Exmap is a memory analysis tool which allows you to accurately determine how much physical memory and swap is used by individual processes and shared libraries on a running system. In particular, it accounts for the sharing of memory and swap between different processes.
To my knowledge, other tools can determine that some memory is shared, but can't determine how many processes are making use of that memory and so fairly apportion the cost between the processes making use of it.
Now this useful tool accounts for shared pages equally among processe, but there might be other ways of accounting, like a more dynamic usage based count. But it already needs a kernel extension just to collect per-process ownership data for pages, because the Linux kernel does not:
Exmap uses a loadable kernel module to assign a unique id to each physical or swap page in use by each process. This information is then collated and 'effective' usage numbers calculated.
That the Linux kernel does not already keep track of this leads to the deeper issue with determining how much memory a process is actually using: that Linux uses a global replacement policy for memory management, that is it treats all processes together, in other words pages from different processes compete as to residency.
Global policies are popular because they are simple to implement, and they make it unnecessary to separately implement the operations of paging (of individual pages) and of swapping (of whole processes).
Local policies (like my favourite, PFF, as in W. W. Chu and H. Opderbeck Program Behavior and the Page-Fault-Frequency Replacement Algorithm IEEE Computer November 1976) instead require to define for each process a working set of pages that are most active in that process, and they try to estimate only which pages should be part of that working set. The reason why that is better done per process than globally is that the working set of a process is supposed to change with time but slowly, under the phase behaviour hypothesis: that processes execution is in distinct phases, and each phase usually has a different working set, which is something that a global policy does not take advantage of.
The crucial point of the working set is that its size is determined such that adding a page would hardly reduce the page fault rate, and removing a page would increase it substantially. A global policy thus will take a page from a process to give it to another process even if it is part of the working set of the original process, as it ignores process boundaries. But rather than stealing pages in the working set of another process it would be better usually to swap out entirely the other process, and steal all its pages, and then swap it in again when it gets scheduled.
Therefore what we would really like to know is not how many pages are resident per each process, but how large is the (local) working set of each process. Possibly adjusted by shared ownership of common pages, where the static measure would be enough, as pages in a working set are by construction deemed necessary.
Of course this discussion as to global and local policies is pointless, because Linux kernel developers seem only much interested in the case where there is enough RAM that no paging or swapping occurs.
060806c Solved problem with automounted filesystems and 'updatedb'
After switching from /etc/fstab to an automounter map for most of my filesystem I have been disappointed to see that I missed a subtle detail that means that the locate database created by updatedb only lists files in filesystems mounted for other reasons.
I had counted on my choice to specify the -g (ghosting) option to automount to create directories to act as virtual mount points, which updatedb would then descend, triggering the mounting of the relevant filesystem. But currently I use Fedora on my desktop PC, and the updatedb in it is from the highly optimized mlocate variant, which checks whether a directory is empty before descending in it, and unfortunately the mere check does not trigger the mounting of the filesystem, and the mountpoint directory created by the -g option is empty before the filesystem is mounted. Indeed I just checked and the stat64 system call does not trigger mounting, but getdents or getxattr trigger the mounting. Which is incorrect, because mounting should be triggered by any access to the inode or its data contents, not just the data contents (the extended attributes read by getxattr are in not in the inode).
I have tried then to configure the updatedb in mlocate to scan the relevant directories explicitly, which triggers the mount, but then the locate databases gets overwritten unless I create a separate database file. But well, I can do so indeed, and simplify several issues, by creating a separate mlocate database for each filesystem in a given list. The database by default will be in the top directory of the filesystem, but optionally the list will have a second field for an explicit name (to cater to the case where the filesystem is read-only).
My /etc/cron.daily/mlocate.cron file now contains:
/usr/bin/updatedb

DBDIRS='/etc/updatedb.dirs'

if test -e "$DBDIRS"
then
  grep -v '^[ \t]*#\|^[ \t]*$' "$DBDIRS" \
  | while read DIR DB
  do : ${DB:="$DIR/mlocate.db"}
    /usr/bin/updatedb -U "$DIR" -o "$DB"
  done
fi
and my profile initializes the LOCATE_PATH environment variable like this:
if test -e "$DBDIRS"
then
  export LOCATE_PATH

  dbdirspath() {
    DBPATH="$1"; DBDIRS="$2"

    { grep -v '^[ \t]*#\|^[ \t]*$' "$DBDIRS"; echo ''; } \
    | while read DIR DB
    do
      case "$DIR" in
      '') echo "$DBPATH";;
      ?*) : ${DB:="$DIR/mlocate.db"}
	case "$DBPATH" in
	'') DBPATH="$DB";;
	?*) DBPATH="$DBPATH:$DB";;
	esac;;
      esac
    done

    unset DBPATH
    unset DIR
    unset DB
  }

  LOCATE_PATH="`dbdirspath \"$LOCATE_PATH\" \"$DBDIRS\"`"
fi
060806b The 'init' daemon and 'inittab'
Having just mentioned that I prefer to start the automounter from /etc/inittab perhaps some discussion of init is useful to justify that.
The original init was a simple thing that just run some script chain, and the script chain would define a few nested run levels corresponding usually to modes, for example single user mode, multi user mode, where the system would pass through each lower level to the target level on startup, and viceversa on shutdown.
In the obvious way: init would execute first thing something like a /etc/singleuser script, which would contain single user initialization commands, a call to /etc/multiuser and single user termination commands; /etc/multiuser would contain multiuser initialization code, and a call to an upper level or just the spawning of getty, and multiuser termination commands.
This simple and easily comprehended structure is still present in the various BSD derivatives and the GNU/Linux distribution Slackware uses a variant of this scheme.
For whatever reason, the UNIX System V developers decided to adopts a seemingly more general mechanism: to have run states instead of run levels (even if the terminology did not change, they are actually still called levels), in the sense that states are not ordered and the system can go from any state to any other state. To support this they added a configuration file /etc/inittab where each line is tagged by the states in which it is valid.
Another interesting feature was added: each command line can be tagged as to whether it is a simple command or whether it runs a dæmon, in which case init can monitor it and restart it if it terminates.
The init program reads the configuration file, and when requested to switch run state, lists all commands that are unique to the source and destination state, and the common ones, dividing the latter in simple and dæmons, and does the following:
  • Kill all the dæmons that are in the source state but not the destination state.
  • Run the once only commands for the destination state.
  • Start the dæmons unique to the destination state.
As it is pretty obvious there are a few significant problems with this scheme, which the nexted run level scheme does not have:
  • There is no way to specify simple commands to execute on leaving a run state.
  • Every possible run state combination must be well defined.
Also, both the run-level and the run-state models share an issue: to change the definition of a level or state one must edit manually the run-level scripts or /etc/inittab. This creates a problem if one has to goal to allow package installation without manual intervention, because the package may be a dæmon
Because of these something very funny happened: the original run level scheme was hacked back into the System V init, only much worse. Because it is based on having nested run levels by convention (and many distributions use different conventions, even if there is a nominal Linux standard) where however is not checked because the scripts run at each level are put in a separate per-level directory. Even more insanely, which script is run in which level is not configured in a file, but by the mere absence or presence of a script (or a symbolic link to one) determines what is run.
Which makes keeping track of what happens a bit hard, and has resulted in a number of utilities to manage the situation. Some systems, mostly Linux based ones, have even created a dual system of determining which script to run in a given run level or state: each script can be present or absent, but if present it checks also setting in some other configuration file before actually doing anything else.
Linux based distributions have generally followed System V, so they have adopted its init model, with variants, so for example to enable/disable individual services RedHat uses files in /etc/sysconfig, SUSE first used the file /etc/rc.config and then this was split into separate files under etc/rc.config.d, and Debian goes for gold by having only two levels, and which scripts run in the upper one is determined solely by configuration in various random files.
A sorry situation, and what astonishes me is that so far very few people have realized that the correct solution is either to improve a bit the run level scheme, or just to fix the System V init, for example by:
  • Add the ability to specify commands to be executed on exit from a run state, not just on entry (as in once, wait).
  • Add the ability to include all the files in a directory.
  • Add some way to tell dæmons that a state change is occurring, to avoid stopping them on leaving a state only to restart them on entering the next.
Ideally one would also add some form of command/dæmon dependency management, to avoid the issue of the order in whch to execute them within a run state.
There however however already several alternative init redesigns. There are however very small chances that any of them will be used in a production distribution, as the inertia of historical accidents probably will prove too strong. However it may be still interesting to have a look at some, for example LFSinit, boot-scripts, the NetBSD 1.5 rcorder system.
060806 An amusing example of (moderately) bad code
While revising some kernel patches (to figure out why writing to DVD[-+]RW and DVD-RAM started working in 2.6.17) I found this amusing bit of bad code:
      for (unit = 0; unit < MAX_DRIVES; ++unit) {
              ide_drive_t *drive = &hwif->drives[unit];
              if (hwif->no_io_32bit)
                      drive->no_io_32bit = 1;
              else
                      drive->no_io_32bit = drive->id->dword_io ? 1 : 0;
      }
The perhaps inadvertent double obfuscation of the logical condition is all too typical. I would have written:
      {
	      const ide_drive_t *const end = &hwif->drives[MAX_DRIVES];
	      ide_drive_t *drive;
	      for (drive = &hwif->drives[0]; drive < end; drive++)
		      drive->no_io_32bit = hwif->no_io_32bit || drive->id->dword_io;
      }
I also wonder whether scanning all drives up to MAX_DRIVES is correct, without checking whether the drive is actually present. But probably harmless. As a final note, the code above is not, by far, the worst I have seen recently; it is just obviously lame.
060805 Using automounter maps instead of '/etc/fstab'
After some consideration I have decided to switch to Linux automounter maps to mount most of my local filesystems, instead of relying on /etc/fstab and the boot scripts. The two are mostly equivalent, but the automounter maps are used by a dæmon to mount filesystems dynamically instead of statically and on usage; mounted filesystems are then unmounted after they haven't been accessed for a while.
The main advantage for me of automounter maps is not that that by mounting automatically they remove the need to issue a mount command before accessing a filesystem, which is trifling issue, but that they do so dynamically, that is filesystems only stay mounted for as long as they are used. This has the not inconsiderable advantage that most filesystems will stay unmounted most of the time, and an unmounted filesystem is clean and does not need to be checked if there is a crash, and this is particularly useful if one does system development and these crashes occur during debugging. Sure, most file system types have journaling and they recover fairly quickly, but because of my PC has MS Windows dual boot, I also have a few (for historical reasons) FAT32 and ext2 filesystems that don't have journaling. Moreover keeping filesystems inactive and unmounted reduces other chances of accidental damage.
Currently I have only one map for mounting my local filesystems under /fs, called /etc/auto.fs and it has two sections, one for mounting filesystem (mostly from removable media) that might exist on any system, and another for mounting PC specific filesystems, and it looks like:
# vim:nowrap:ts=8:sw=8:noet:ft=conf
#MOUNTP -fstype=TYPE[,OPTION]*                                  :RESOURCE

# Host independent
##################

0       -fstype=ext2,user,rw,defaults,noatime,nosuid,nodev      :/dev/fd0
a       -fstype=vfat,user,rw,nocase,showexec,noatime,umask=077  :/dev/fd0
A       -fstype=msdos,user,rw,noatime,umask=077                 :/dev/fd0

1       -fstype=ext2,user,rw,defaults,noatime,nosuid,nodev      :/dev/fd1
b       -fstype=vfat,user,rw,nocase,showexec,noatime,umask=077  :/dev/fd1
B       -fstype=msdos,user,rw,noatime,umask=077                 :/dev/fd1

sda1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sda1
sdb1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sdb1
sdc1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sdc1
sdd1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sdd1
sde1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sde1
sdf1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/sdf1

uba1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/uba1
ubb1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/ubb1
ubc1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/ubc1
ubd1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/ubd1
ube1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/ube1
ubf1    -fstype=vfat,user,ro,nocase,showexec,noatime,umask=077  :/dev/ubf1

pkt     -fstype=udf,user,rw                                     :/dev/pktcdvd/0
udf     -fstype=udf,user,ro                                     :/dev/cdrom
cd      -fstype=cdfs,user,ro,mode=0775,exec                     :/dev/cdrom
iso     -fstype=iso9660,user,ro,mode=0775,exec                  :/dev/cdrom
r       -fstype=iso9660,user,ro,mode=0774,norock                :/dev/cdrom

pkt1    -fstype=udf,user,rw                                     :/dev/pktcdvd/1
udf1    -fstype=udf,user,ro                                     :/dev/cdrom1
cd1     -fstype=cdfs,user,ro,mode=0775,exec                     :/dev/cdrom1
iso1    -fstype=iso9660,user,ro,mode=0775,exec                  :/dev/cdrom1
s       -fstype=iso9660,user,ro,mode=0774,norock                :/dev/cdrom1

# Host dependent
################

home    -fstype=jfs,defaults,noatime                            :/dev/hda2

c       -fstype=vfat,rw,nocase,showexec,noatime,umask=02        :/dev/hda3
d       -fstype=ext3,rw,defaults,noatime                        :/dev/hda6
e       -fstype=ext3,rw,defaults,noatime                        :/dev/hda7
Interestingly autofs maps can be scripts that given the mountpoint print the options and resource to mount on it, but for my simple purposes of /etc/fstab replacement that is not necessary, a static map is good enough. I have of course kept an /etc/fstab, but it now contains only the indispensable static mounts, which are about virtual filesystems and swap:
# vim:nowrap:ts=8:sw=8:noet:
#DEVICE         MOUNT           TYPE    OPTIONS                   DUMP PASS

# Host independent
##################

none            /proc           proc    auto,defaults                   0 0
none            /sys            sysfs   auto,defaults                   0 0
none            /dev/pts        devpts  auto,mode=0620,gid=tty          0 0
none            /proc/bus/usb   usbfs   auto,defaults                   0 0

none            /proc/nfs/nfsd          nfsd    noauto                  0 0
none            /var/lib/nfs/rpc_pipefs pipefs  noauto                  0 0

# Host dependent
################

/dev/hda1	/		jfs	defaults,errors=remount-ro	4 1
none            /tmp            tmpfs   auto,mode=0777,size=700m,exec   0 0
none            /dev/shm        tmpfs   auto,mode=0777,size=100m,exec   0 0

/dev/loop7      swap            swap    noauto,swap,pri=8               0 0
Instead of using the autofs rc script to start the automount dæmon, I have chosen to start it directly in /etc/inittab, because that is where it should be, and thus I have added these lines to it:
# Use 'automount' instead of '/etc/fstab'. Local and remote maps.
af:2345:wait:/usr/sbin/automount -g -t 60 /fs file /etc/auto.fs
aa:345:wait:/usr/sbin/automount -g -t 180 /am file /etc/auto.am
Note that I have specified the ghosting option -g to make the mountpoints visible under the main mountpoint, and slightly different and somewhat longish autounmounting timeouts. One can always issue umount explicitly if one wants a quick unmount, for example for a removable medium.
One possible downside to this is that startup becomes a little more fragile, but not really, because the essential filesystems (root and possible /usr and /var if separate) should in any case be specified in /etc/fstab and be mounted statically at boot. Surely nowhere as fragile as udev. Another possible downside is that mounting is slow for some file system types, most notable ReiserFS, because some extended consistency checks are performed. In such a case perhaps lengthening the default mount timeout is a palliative.
Instead of using autofs I might have used AMD. AMD is a similar dæmon which is system independent (instead of using the special autofs module of Linux it pretends to be an NFS server) and rather more sophisticated. AMD is more suitable for large networked installations, where it has considerably simplified my work a few times, and I just wanted a dynamic extension to /etc/fstab and for that autofs is quite decent.
autofs in effect is a subset of the ability in Plan 9 to mount processes in the filesystem name space, and vaguely similar to the BSD portal feature for example as in mount_portalfs.

July 2006

060729 Volumes and filesystem tags
Hardware configurations are getting larger and more complicated in many fields, and this creates the not slightly difficulty of how to identify devices and their contents. The traditional way is to identify devices by location, and not to identify their contents, but this is becoming rather less convenient as locations can change (especially because of PnP and autoconfiguration).
Therefore some OS bits can identify peripherals and their contents by some kind of tag, and in particular this applies to storage devices. On a x86 style PC most storage devices are partitioned using PC-DOS style partition tables, which are contained mostly in PC-DOS style volume description blocks called MBRs because they contain boot code too. Then each partition may contain a filesystem tag (name or id or both) in its superblock.
I only recently started to pay attention to the identification data in the MBR and the superblocks because of LILO, the Linux-oriented boot loader. Recent versions of LILO identify the boot filesystem by tag, not by location, which has caused me some complications, because my backup scheme is to do partition verbatim copies from one disk to another, and tis of course copies the superblocks too, and the lilo compiler does scan all partitions to verify that they have distinct tags. Fortunately it is fairly easy to ask lilo to ignore specific drives, and I did so for my backup drives.
But then I decided to investigate tags a bit more, as they also matter to MS Windows, which I occasionally boot too, and to other pieces of software. As usual the situation is messy and a bit warped by historical issues. To summarize the simplest parts of it, tags can be:
  • The MBR volume id, which is the 4 bytes at offset 0x1b8. There are some scattered utilities to set it, but it can be set and reset to a random value with some special options to a recent version of lilo, or perhaps best examined and changed with a binary editor like hexedit.
  • The filesystem name, which most Linux file systems support, and is usually a string. Most Linux file systems have a utility that can set it at format or any other time.
  • The filesystem UUID, a largish number that most Linux file systems also support, and which is as a rule generated semi-randomly.
  • The UUID or name of Linux md RAID arrays or member volumes, or UUIDs for LVM2 physical volumes, volume groups and logical volumes.
In theory only the MBR volume id is really necessary, because the location of a partition on a disc is a simple number that as a rule does not change, but it is the filesystem id that is the most often used tag.
Except in two important cases: the MBR volume id is used by LILO, and it is used by MS Windows NT and later to verify the boot partition. The MS Windows boot partition contains the registry, and in the HKEY_LOCAL_MACHINE\SYSTEM\MountedDevices\ and HKEY_LOCAL_MACHINE\SYSTEM\DISK\ subtrees it is used to map partitions to drive letters. In particular it is used to identify the boot partition, and a mismatch between the volume id recorded in the registry for the boot partition and that on the drive will prevent MS Windows from starting. This can be fixed most easily by zeroing the volume id in the MBR, a special case which MS Windows handles by generating a new one and fixing the registry with that value.
060728b Using humans as information system components
I have just read a rather impractical discussion of using human rather than artificial intelligence to do subtle but repetitive tasks like image recognition. The fundamental idea is that humans are cheaper than machines, and therefore one can build cheap services with a web interface and a human backend:
The jobs all consist of some simple task that can be performed at your computer -- such as viewing pictures of shoes and tagging them based on what color they are. You get a few pennies per job
Computers suck at many tasks that are super-easy for humans. Any idiot can look at picture and instantly recognize that it's a picture of a pink shoe. Any idiot can listen to a .wav file and realize it's the sound of a dog baring. But computer scientists have spent billions trying to train software to do this, and they've utterly failed. So if you're company with a big database of pictures that need classifying, why spent tens of thousands on image-recognition software that sucks? Why not just spend a couple grand -- if that -- getting bored cubicle-dwellers and Bangalore teenagers to do the work for you, at 3 cents a picture?
Well, true, but then the obvious way to provide this service is for one of the any idiots to open 100 service provider accounts and replace himself with a little program that just clicks a reply at random. There are already many problems with Google link clicking abuses.
Because the problem is, as one comment says, quality control, and quality control is expensive. Perhaps the best approach would be to give the same task to say 9 different people and then take as valid only those answers where at least 6 match. But this increases the cost of the system.
The overall concept is based on my observation that a computer system always includes as peripherals some humans, and that functionality can be split between the machine and human components. For example when machine components were expensive, humans would run a compiler in their brains, and the compiler now runs on the machine CPU now that the cost ratio has changed. Similarly in a small system the spooler can be just two users telling each other who is next to print, and in a larger system it can be a dæmon performing the same function.
Indeed humans can be used as pattern matching coprocessors as in the above, and data entry operators have been used as OCR peripherals for decades. But then one must consider carefully the structure of incentives, because while most machines for now obey orders, many humans obey incentives, a point of great importance in ensuring that computer using projects deliver benefits.
060728 A bitter irony in the design of 'udev'
My contempt for the misdesigned and poorly implemented udev subsystem of GNU/Linux has reached new depths, as I realized that one of its (usually unmentioned) components, the sysfs file system, is actually just shy of being a devfs clone that would make udev largely irrelevant.
The key absurdity is that for every device recognized by a driver there is an entry (among many) in /sys containing the major and minor numbers of the device special file that should be created for that device, for example:
$ ls -ld /sys/class/input/mice/dev /dev/input/mice
crw------- 1 root root 13, 63 Dec  4  2005 /dev/input/mice
-r--r--r-- 1 root root   4096 Jul 28 14:56 /sys/class/input/mice/dev
$ cat /sys/class/input/mice/dev
13:63
What? Why not the device special file itself? It would even be easier.
But then the game would be too transparent: because then sysfs would be just an overcomplicated clone of devfs, and udev would be essentially irrelevant.
Another bitter irony is that one of the alleged advantages of udev was that udev is small (49Kb binary), which sounds to me rather disingenuous to say the least:
# size /sbin/udev{d,d.static}
   text    data     bss     dec     hex filename
  63821     940   10496   75257   125f9 /sbin/udevd
 741984    2236   51732  795952   c2530 /sbin/udevd.static
and that is only the dæmon, never mind the cost of sysfs itself.
I am not surprised that the author of devfs chose not to to defend it against prejudiced attacks on it and just let it drop. In free software no less than in proprietary software it often happens that self interested territorialism enhances job security, and the most obnoxious and self serving attacks trump merit easily.
One of those attacks is the probably knowing prevarication that devfs forces the devfs naming policy into the kernel: the naming scheme forced into the kernel is the Linus Torvalds scheme, a scheme that the author of devfs actually disliked and was told to implement instead of a more sensible one:
Devfsd provides a naming scheme which is a convenient abbreviation of the kernel-supplied namespace. In some cases, the kernel-supplied naming scheme is quite convenient, so devfsd does not provide another naming scheme. The convenience names that devfsd creates are in fact the same names as the original devfs kernel patch created (before Linus mandated the Big Name Change). These are referred to as "new compatibility entries".
Another probably knowing prevarication by the author of udev is the claim that If you don't like this naming scheme, tough.: in the very FAQ for devfs two alternative naming schemes are described, implemented by the devfs customization dæmon, devfsd:
Kernel Naming Scheme
Devfsd Naming Scheme
Old Compatibility Names
and indeed just about any naming scheme can be implemented by devfsd.
In the devfs and udev discussion the main and probably knowing prevaricatiob is the comparison of udev with devfs and not devfsd, and ignoring the essential role of sysfs in the udev story.
In effect the sysfs and udev pair is just a complicated overextension the devfs and devfsd pair, and one that does not have some of the more useful features, and is more fragile, because while devfs can be used without devfsd, udev cannot really be used without sysfs (because of coldplugging) and sysfs cannot be used without udev because it stops just one step short of creating device special files.
Apart from being an overcomplicated subset of devfs and devfsd, the main difference with sysfs and udev is that the latter pair is the kernel land grab of GKH, and thus benefits his standing and his job security instead of those of RG, and this may be regarded by some as a good motivation for disingenuous attacks.
060727b Modern games are rather CPU bound
Reading the August 2006 issue of PCFormat I was amused to see a CPU showdown review in which various very recent CPUs are benchmarked within a top end system with a top end graphics card (1GiB RAM, NVIDIA 7900GTX with a price of around £280 before tax), and the frame rate depends heavily on the speed of the CPU (note that only one processor is really used by the games, even if the chip has two):
CPU benchmark in two recent games
performance in average Frames Per Second
CPU model Clock L2
cache
Price
before tax
Half Life 2:
Lost Coast
Far Cry
Pentium D 805 2x2.7GHz 2x1MiB £58 52 54
Sempron 3600+ 2.0GHz 256KiB £58 67 55
Pentium 4 560 3.6GHz 1MiB £110 70 63
Athlon 64 3800+ 2.4GHz 512KiB £61 83 70
Athlon 64 X2 4000+ 2x2.0GHz 2x1MiB £105 91 71
Athlon 64 X2 4400+ 2x2.2GHz 2x1MiB £219 95 75
Athlon 64 X2 4800+ 2x2.4GHz 2x1MiB £345 108 83
Core 2 Duo E6300 2x1.9GHz 2MiB £115 104 86
Core 2 Duo E6400 2x2.1GHz 2MiB £128 111 89
Core 2 Duo E6700 2x2.7GHz 4MiB £310 149 107

Compare with benchmark tests of the NVIDIA 7900GTX (and the similar ATi X1900) and other cards with an Athlon 64 FX-60 CPU (2x2.60GHz, 2x1MiB L2 cache), which is bit faster than the Athlon 64 X2 4800+ in the table above, on the same two games, Half Life 2: Lost coast and Far Cry at various resolutions and graphics options. It looks like that the CPU test from which the table above was derived was done at 1024x768 in top quality mode.
My overall impression is that both games are rather CPU bound. Especially at 1024x768, and probably they become graphics bound only at resolutions higher than 1280x1024. At the lower resolution FPS increase by a factor of three from the slowest to the fastest CPU.
Now if this happened at very high FPS it would mean that the graphics chip was still the limiting factor, but the range starts at 52FPS. With a less powerful graphics card the games would become graphics bound sooner, but then the quality settings are very high, and with a slower graphics card one would not set them so high.
060727 Another user discovers that 512MiB is too little
A while ago I sold out and got another 512MiB of RAM, and now I see that another user has done the same, for the same reason: working set of a Linux desktop environment exceeding 512MiB, mostly because of web browsing, but not only.
060726 Another symptom of the Microsoft cultural hegemony
Another symptom of the increase of the Microsoft cultural hegemony and the continuing fading of the UNIX culture is that many people are using options in Linux commands after the arguments, and not before, for example:
mount /dev/scd0 /media/dvdram/ -t udf -o rw,noatime
which is actually dangerous, because many if not most commands scan their argument list strictly left-to-right and options apply only to the arguments that follow them. This is a simple and obvious concept, so how comes these people put the options afterward? Because that's what they are used to do in Microsoft's cmd.exe, try to imagine the previous line as:
mount \dev\scd0 \media\dvdram /t:udf /o:rw,noatime
060725b Interview with the CEO of NVIDIA
In the wake of the AMD/ATi merger, I have noticed a very informative very recent interview with NVIDIA's CEO, a person that I have often admired both for business and common sense. Of the several things he says about his life and technology this one seems to me quite impressive:
We were confronted with that question with GeForce 256 in 1999. That was on par with the Reality Engine supercomputer that Silicon Graphics used to sell. The fill rate, the performance, the throughput of the chips was amazing. In 1999, we were confronted with that question. I thought we made one of the most important decisions for not only our company but the entire industry. That was going to programmability. As you know, investing in programmability, which ultimately ended up in GeForce FX, made the graphics chip far more programmable. But the cost was so high. It was quite a burden on our company for a couple of years. I would calculate that the amount of profitability lost on GeForce FX was $200 million a year for two years. It was almost $500 million in lost profit. That was a half a billion dollar risk we took going to a completely programmable model.
Where are we betting? We are betting a lot on the cell phone, believing that multimedia will be an important part of these devices. With GPUs, we are betting a huge amount on programmability. GPUs are going to become more programmable, easier to write programs, and support much more complex programs.
as it also has implications for physics chips. Another interesting statement is about the amount of money NVIDIA are spending in developing graphics chips:
We invest $750 million a year in R&D in graphics processing. No other company invests that much in graphics processing today. This is such an incredibly deep technology and there is so much more to do. It makes sense that in the long-term we would work on game consoles as well. The others can't keep up with the R&D that we do. That part makes perfect sense to me.
both because $750m/y is a lot on its own, and on how it compares with the amount of money Microsoft paid ATi to develop the graphics chip for the Xbox 360, with some slightly contradictory news:
Microsoft began the Xbox 360 project in late 2002.
In 2003, Microsoft decided it wanted to launch its next-generation console before Sony.
IBM put 400 engineers to develop the Xbox 360 CPU.
IBM taped out (finished the design) of the CPU in December 2004.
ATI put 300 engineers to developer the Xbox 360 GPU, aka Xenos.
ATI finished its GPU design on November 2004.
ATI had 175 engineers working on the Xenos GPU at the peak. These included architects, logic designers, physical design engineers, verification engineers, manufacturing test engineers, qualification engineers and management. The team was spread out between Orlando, Florida, Marlborough, Massachusetts and Toronto, Ontario. The team's size varied during the life of the program -- in fact, we still have about 20 engineers involved in various cost down projects.
Looks like 2 years at perhaps average 100 engineers/year average, like 200 person/years at $100k each, probably around $20-40m contract value, which matches what Sony paid for NVIDIA's equivalent chip for the PS3:
Sony paid Nvidia about $30 million for designing the PS3's graphics chip, a fee that was paid out gradually as work on the processor was completed, Nvidia's chief financial officer Marv Burkett said.
which however does not include the royalties on the chip, which can be a lot more. I remember reading that ATi got paid by Microsoft overall around $350m for the Xenos GPU IP.
060725 ATi and AMD merge
So it is now official that ATi and AMD are merging. Even if the deal is structured as a purchase, as AMD is paying out quit a bit of cash, it is more like a merger, as ATi shareholders get a very important slice of AMD's stock, and there is clearly little overlap between the two businesses, and the price paid (a fairly low premium) and the friendly nature of the deal mean that ATi management feels they will matter in the new company.
My reaction is that it is a rather odd merger, because it is not clear what AMD will be getting out of it. To me it seems that AMD's primary motivation is to add graphics to their product line, because after all with rather weak technology Intel have 40% (probably by value, by volume it is far higher) of the graphics market and probably AMD felt they had to counter that.
But that seems a bit odd to me, as Intel's graphics chip business is really a subset of their chipset business, and AMD have been usually reluctant to do chipsets. They had done very good chipsets for their previous generations, like the 76x series for the Athlons, but always with just the intention of seeding the market, and then they have let chipset specialists like VIA supply the market and gradually withdrawn their own presence. Ironically, the NVIDIA nForce chipset series was mostly designed by AMD for Xbox, and was licensed to NVIDIA when AMD lost the Xbox CPU contract to Intel at the last minute. Arguably with Athlon 64 they now are still in the chipset business, as the north bridge, the bus and memory controllers, are not included on the CPU itself, but is stretching things a bit.
Now it feels like that AMD want to offer graphics chips, including the high end ones that Intel does not. Perhaps they now have sufficient fabrication capacity, or perhaps they want to put graphics integrated on the CPU chip itself:
Well the point there really is that we now have the chipset, we have CPUs, we have GPUs, and we can mix and match the appropriate level to each of those things to different customer requirements.
as they have done already so with the north bridge chip; after all all those ever increasing transistor budgets given by process improvements have to be used up somehow, but I had assumed that on die cache and extra CPUs would easily absorb them.
A graphics core on the same die as a CPU core could give remarkable bandwidth and performance improvements, but at the same time it makes the whole dies obsolete quicker, as the graphics generation interval is a lot shorter than the CPU generation interval.
There is also a large downside to the deal: that now NVIDIA, whose graphics cards are sold along with very many AMD CPUs, will now see AMD as a competitor and not a partner. Intel also competes with NVIDIA (and ATI) with their low end graphics chips, and competes with NVIDIA also in their motherboard chipset range, but it does not compete with NVIDIA in high end graphics, which is far more critical to NVIDIA, but AMD/ATi of course will. AMD/ATi think that they can cooperate and compete with both Intel and NVIDIA:
The whole company, well the new company wants to give as much choice to its customers as possible, so if you want an Intel CPU with an ATI GPU then great, we'll be happy with that. If you want an AMD CPU with an NVIDIA GPU then we'll be very happy with that as well.
The key may be that both AMD and ATi see the future as being not high end chips but low power chips for handhelds and laptops:
Yeah, we want to be able to package really high performance, low power CPU with a high performance low power chipset and a really high performance low power GPU for mobile, that's really ATI's hallmark.
But to me ATi's hallmark is very high performance chips (but then even ATi's top end chips are way less power hungry than NVIDIA's).
Overall I suspect that VIA would have been a much better fit, also because VIA competes vigorously with Intel too, and would have benefited from shielding behind AMD's patents and cross licensing agreements.
060724d Mark Rein on games on laptops and Intel graphics chipsets
Mark Rein is a rich guy from a very rich company, and he can afford to speak out on game industry issues, and one of those he has addresses recently is serious and interesting: his contention is that underpowered Intel graphics chips:
Over several slides on the topic, Mark laid out the reasons he thinks that PC gaming is being harmed by Intel. He pointed the finger at Intel's integrated graphics chips. Integrated chipsets are often incapable of playing the latest (and certainly next-generation) games at any kind of graphics settings. Despite this, they are wildly popular amongst retailers. According to Mark's figures, 80% of laptops and 55% of desktops (note: he failed to cite a source for these figures) feature integrated graphics.
are undermining the PC games industry. My summary of the story with Intel graphics chips is:
  • They are far slower and with much less memory than the typical midrange or even budget PC gaming graphics card.
  • They are integrated into motherboard chipsets which end up in an overwhelming majority of laptops sold and a large number of budget desktops too.
  • Laptop sales by now are larger than desktop sales (surely in money terms, and probably in unit terms too).
  • Since recent consoles have graphics chips equivalent to midrange or even premium PC graphics cards, most laptops and many desktops are at a huge disadvantage.
  • The PC game industry develops games that meant to be released on PCs with recent graphics cards and consoles too, and thus are nearly unsellable on most laptops and many desktops.
Note however an interesting contradiction in the argument:
they are wildly popular amongst retailers. According to Mark's figures, 80% of laptops and 55% of desktops
In other words it looks like that Mark Rein should not really be complaining that Intel is responding to a huge demand by paying customers for low end graphics chips (as I often say, the [paying] customer is always right), but about the unwillingness of some game developers to address a very large market segment.
I don't think that it is also a coincidence that while traditional PC based game sales are falling, those of PC based persistent online games are exploding as these rely more on gameplay than graphics advances for their popularity, and their PC based clients are usually deliberately designed to run decently even on low end graphics chips like the Intel ones.
Mark Rein's contention that it would take only about US$10 to improve sufficiently Intel graphics chipsets is based on a grave mistake: that may be the cost to the manufacturer of the chipset. But as a rule the retail price ends up being a multiple of the cost of parts, so that if Intel ended up charging something extra the cost of US$300 PCs would go up noticeably.
The other thing that this discussion tells me is that while graphics quality is somewhat scalable, in that most PC games have controls for tradeoffs between speed and visual sophistication, it is not that much scalable. The obvious suspicion is that while decorations like texture quality, lighting and special effects are easy to scale, geometry is much harder to scale, and the real problem with many PC games is high triangle counts, which cannot be handled by the Intel graphics chips.
As to triangle counts, the games and game engine sold by Mark Rein's have been designed since UT2003 with the assumption that graphics chips can handle large polygon counts entirely onboard:
Unreal II was released in Feb/2003. The polycounts are in general, very high. In the opening level, polycounts ranged from 20K (thousand) polys to 85K at parts. The outdoor sections had more polys on average, about 45-60K. Indoors, in the training area, the average was 35-38K. There were no enemies in these levels, which allows for more level detail.
The ship that serves as your main base of operations has slightly more detail. Since the space is smaller, it's possible to cram more polys into each area. The count ranged from 40-100K, averaging about 60K.
In the first level with enemies and core gameplay, the count ranged from 15-75K. While in combat, the average was about 25-30K polygons, including enemies. Obviously, when designing heavy combat areas, your level of geometry density will have to fall. In the third level, portions of the level (exterior, beginning) reached as high as 150-160K polys drawn onscreen at once. For most of the core gameplay, however, the count was between 50-80K polygons.
and even more so for the upcoming Unreal 3 engine:
Over 100 million triangles of source content contribute to the normal maps which light this outdoor scene.
Wireframe reveals memory-efficient content comprising under 500,000 triangles.
Only 500,000 instead of 100 million, quite an interesting point of view.
060724c Getting software into the Linux kernel
Programmers are often territorial, and one of the more valuable pieces of territory is the base Linux kernel, as owning a sizable chunk of it is usually well rewarded, as that gives significant benefits to anyone with a vested interested in the wide availability of that code.
I have remarked earlier about such territory is guarded by gatekeepers who have granted a new homestead on that territory to the ext4 file system even if it is in course of development. There have been more discussions instead about the code for the Reiser4 filesystem has been waiting for two years, a contrast that Hans Reiser himself has noticed.
However the XFS story is not such a good example of arbitrarily delayed inclusion in the main-line kernel territory as there were genuinely large technical problems precisely due to it relying on a non-Linux abstraction layer. Compare JFS, which was also a port from another kernel, and was accepted more quickly also because it more rapidly went native.
But there is an even more amazing case of speedy satisfaction of a desire by someone for a chunk of GNU/Linux territory: udev related facilities were added even if they were entirely unproven, even if they were evolving rapidly and were not stable, and even if the devfs alternative was already in the kernel and widely used. A clear case for keeping udev off the main line of development, both kernel-space and user-space, but it has not worked out quite like that.
060724b Partitions, extended partitions and 'ms-sys -p' on NTFS
I have noticed, while looking at the MBR on one of my drives, that some MS Windows style tools (one of them being Partition Magic could not quite list all partitions on some of my hard drives correctly, in particular logical partitions inside extended partitions. Logical partitions have a particularly dangerous definition in the IBM PC based partitioning scheme, because they are defined by a chain of partition descriptors, where each logical partition contains a descriptor that defines its start and length, and thus the beginning of the next. The descriptor is contained in the logical partition's boot block (block 0), and is agumented under MSDOS 6 partitioning by some extra information in some subsequent block if the partition contains a FAT32 filesystem.
So I suspected that I had to fix the boot blocks of the logical partitions, so I applied liberally the extraordinarily useful ms-sys utility to most partitions as ms-sys -p to rebuild correctly the partition information in the boot block thereof. Then I discovered that NTFS is among the very few filesystems that puts vital information of its own in sector 0 of its filesystems. Fortunately I had a backup image of that partition, and restoring the boot block from that fixed the issue. Anyhow the ms-sys -p seemed to fix whatever issues those MS Windows based programs had with partition descriptors.
060724 Very odd extra 16 bytes
Among unexplained mysteries something really strange happened to be recently: one of my hard disc drives appeared to have all lost all partitioning information, suddendly. I studied the the MBR and the partition table and I realized that there were 16 seemingly random bytes at the beginning of the MBR. So I decided to zero it out and copy it back from a backup, and the same 16 bytes appeared again as the first 16 bytes in the zeroed block. Then I realized that those 16 bytes were not overwriting somehow the first 16 bytes of the first sector of the disc, but were prepended to it, and to the whole disc, which would have made it unusable. So it could not really be that the 1st sector of the disc had become damaged, and the problem was probably some failure with some electronics in the platter-to-memory path. Hoping it was a soft fault, perhaps caused by heat or static, or just some bitrot, I fully powered down the PC and after some time powered it up, and those prepended 16 bytes disappeared, and haven't reappeared since. Very odd.
060723 Now IO has priorities too under Linux
Well, I have an annoying backlog of entries to write (been somewhat busier than usual), but this is quick and useful: someone I know has spotted that now the cfq elevator for Linux has got time slices and priorities for IO and the priorities can be set with the ionice command. This makes the cfq elevator even more useful to preserve good interactive response without much cost.
Even without ionice I have been able to play Linux based big games like UT2004 and Doom 3 fairly well thanks to my usual modified kernel settings and even with some bulk uploads (sharing my collection of GNU/Linux liveCD and installer ISO images) and downloads going on, thanks to my traffic shaping script. Play is less snappy but quite reasonable, and the main effect is that the copy rate goes down from 50MiB/s to something like 40MiB/s.
It gets however a bit painful when I stop playing, at least with UT2004, because it grabs a lot of memory, and then several hundred MiB of my non game applications get swapped out. After all I have only 1GiB and UT2004 grows to 500MiB... The processes I have active include some instances of Konqueror between 90MiB and 180MiB Konqueror, 100MiB Kate, 170MiB Firefox, and one of the smallest is XEmacs at 33MiB. I think that some these sizes are amazing. For comparison Tribes 2 only take 150MiB, which is smaller than Firefox or a heavily used instance of Konqueror.
060713 Switched to SMP and voluntary preemption
Because of my attention to the social definition of works I usually try to use software and configuration options that I think most developers and users adopt, because the others don't get tested. Now that most Linux kernel developers are well paid and use top range PCs, I have switched to using Linux kernels compiled for multiple CPUs and with preemption. The reason here is that Intel Core Duo and AMD Athlon 64 X2 based PCs are becoming ever more popular among those who can afford them, and in particular those who do distributions tend to have them because they make building binaries a fair bit faster.
So far so good, I am using a SMP Linux kernel even if on a uniprocessor, and with preemption, and contrarily to past experiences it seems reliable (so far) even if I use the proprietary NVIDIA driver to play games.
As to preemption, it used to be a somewhat risky option, but since it is more or less part of SMP anyhow, I brought myself to enable it, even if just the voluntary variant only. Seems to have become more reliable.
Altogether SMP and voluntary preemption seem to have made my system a little more responsive in interactive use, even if by far the biggest contributions to interactive responsiveness under load are given by better choices for disk accesses and memory management.
060702 The 'ext4' file system and RHEL
In the past few week there has been some discussion (1, 2) on extending the ext3 filesystem to cope efficiently with filesystems larger than 8TiB and files larger than 4TiB, which involves extending and redesigning the on-disk file system metadata, in particular to extend block numbers to 48 bits from 31 bits.
Despite the sensible skepticism of Linus Torvalds there is now a plan for work towards ext4, which to me looks a bit like a job security scheme for the ext3 developers, as otherwise, if it was determined to be a completed, done job, as Linus thinks ext3 is:
A stable filesystem that is used by thousands and thousands of people and that isn't actually developed outside of just maintaining it IS A REALLY GOOD THING TO HAVE.
they would have somehow to find something else to do. Except that if one understands the context there is also a corporate agenda, designed to benefit a specific and very important company.
The basic issue with creating an ext4 file system is that it is unnecessary: all the improvements in it are for large system support mostly for enterprise level customers, and those needs could be well served with other well established, well tested file systems, like JFS, XFS, or Reiser4 (which has been out for a couple of years already); of these my favourite is JFS, because it scales well both up and down, is very well tested, and has some unique and important advantages over XFS (which however scales better for really huge filesystems).
Who would ever need a 60 bit (48 for block pointer plus 12 for byte within block), extent based ext4 when there are 55 bit extent based JFS, or 44 bit (63 bit on 64 bit CPUs) extent based XFS, and they have been around for many years, and are already widely used by those corporate customers that would be targeted by ext4?
Well, there is only one notable answer, and it is: RedHat and users of its Enterprise Linux product, because it only supports ext3 currently, which is starting to cause problems, as it restricts the whole system to its own limits.
RedHat could add support for other file systems to the next version of RHEL but this would have several disadvantages, and these disadvantages are the reasons why currently RHEL only supports ext3:
  • RedHat has a program of testing and qualification for RHEL and applications running on it. For every additional file system type supported the testing and qualification would have to be repeated, which may be a big problem as to time taken and as to the effort required by the ISVs which are already struggling to qualify their applications on multiple distributions.
  • Quite a bit of automated installation and maintenance software would have to be extended or rewritten to support multiple file system types and their subtle differences.
  • Lack of backward compatibility with existing customer filesystems, which would have to be dumped and restored into a new file system structure.
While RedHat could support multiple file system types in RHEL, or switch to supporting a single different large address extent based filesystem type like JFS or XFS or Reiser4, or add large addresses and extents to ext3. RedHat evidently believes that the latter course is best for them, to continue to support a single file system type, but one that is derived from ext3 and backwards compatible:
It has been noted, repeatedly, that users very much like to be able to get new features into their filesystems without having to backup and restore the whole thing.
so that in the next or next but one version of RHEL you can just put it in, and any existing 43 bit ext3 filesystems continues to be usable, but any new files added will be using the new 60 bit addresses and extents.
Now this is nice, but it is largely pointless for most distributions and users except for RedHat and its RHEL partners, and it is therefore totally unsurprising that the ext4 plans involves mostly RedHat developers and those from Lustre who do a RHEL cluster filesystem based on a modified ext3 (and are probably dying to be bought out by RedHat like Sistina was) which already includes some of the proposed ext4 changes.