Software and hardware annotations q2 2006

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

June 2006

060625b Swap space misallocation in Linux
I was chatting with someone who was annoyed by the high level of clicking and slowness when swapping under Linux, and I had to break him the bad news.
Not just the Linux swap module does something quite objectionable when paging in, it compounds that with a peculiarly unfunny swap space allocation strategy...
OSes typically have two policies about swap space: some, like *BSD the Linux, allocate swap space on demand, just before swapping out a page. Others preallocate, so that swap space needs to be at least as large as main memory. The latter policy is designed to prevent a resource deadlock, where a page needs to be swapped in, this causes the page out of another page, but all swap blocks are busy.
The consequence of the policy to allocate on demand, at the last possible time, is not just to allow the remote possibility of a resource deadlock (which can be anyhow broken by killing the process to which either the page to be swapped in or swapped out belongs), but has a far worse implication: that swap blocks get allocated to page more or less at random, causing terrible seek storms when a process has to page in a lot of pages. On my PC swap ins tend to run at around 1-2M MiB/s, on a disk capable of around 60MiB/s when doing sequential transfers, and indeed the noise the arm makes during swap outs of idle processes or swap ins of newly reactivated ones (in which most of a process' pages get swapped in or out) bears out that the issue is just too many seeks.
Incidentally the same problem can happen with the policy of allocating a swap block for each memory page, as the obvious way to do this is one-to-one; pages in memory tend to be allocated at random, and this does not matter, but reflecting on swap space on disk does.
The obvious alternative is to take advantage of the much, much grater size of disks compared to memory, and statically preallocate a contiguous or mostly contiguous chunk of swap space every time a new virtual memory segment is defined.
This means allocating swap space for every mmap(2) or sbrk(2) allocation. Sure, this requires swap space, if at all present, to be at least as large as the amount of real memory one wants to make use of, but that does not mean a big deal nowadays.
Looking at this from another point of view one might as well then have swap space as a simplified extent based filesystem, with but one directory containing as many (sparse) files as there are processes, or with as many directories as there are processes and as many files there are mappings. In other words the idea is to turn all anonymous mappings into file mappings into an instance of this file system.
Even a more complex extent based filesystem might do; on my JFS filesystems I can easily see 40-50MiB/s transfer rates on largish files. One could have instead of a swap partition, a swap directory in an existing filesystem.
Anyhow I would like to have a simple minded, low overhead, filesystem designed to support only a smallish number of largish mostly contiguous/extent based files, because it would also be able to replace not just the current delirious swap space manager, but also could replace the Device Mapper which is just an even more primitive version (manual extent allocation!) of such a filesystem.
060625 Partial specialization for APIs in C++
I have been doing some coding experiments for (simple) multiplatform, multialgorithm game oriented math operations. The idea here is that the same API might admit different implementations on different platforms and using different algorithms, and that might happen within the same program because in some places a representation is better than another in other places. But then most language designers are not even aware of the idea that inlining should be declared at the call not at the function definition.
As previously noted, the way to go in C++ is with templates and partial specialization which is however a perversion of mechanisms designed for other narrower purposes.
I have met an important case where this cannot be done cleanly, and it is when one wants partials specialization in the presence of multiple template arguments. For example, in one case, representing spheres, there is a template argument for the representation of the radius (e.g. whether the square of the radius is memoized or not) and another for the representation of the center (e.g. whether it is an unaligned 3-vector or an aligned 4-vector for SSE).
The difficulty here is that some methods of the sphere API need only be specialized on the representation of the radius, and others only on the representation of the center (and some on both), so one would want to partially specialize on only one type of another, to avoid duplicating the same code needlessly.
But it just happens that it is not permitted to partially specialize member functions (or functions in general). It is possible to partially specialize the class template that defines the sphere API, but then the definition of that API has to be bodily duplicated for every such specialization (a bit like for derivation), and there is no check that a partial specialization of a class template have an API be the same or a subset/superset of the original class template.
This is anyhow just one of the many peculiar rules and special cases of templates in C++. For the brave here is the chapter on templates from a draft of the C++ standard of 1996. I am not afraid of dense language definitions (I can read conversationally the Algol 68 Revised Report), but it just appalls me, because of the semantics (and the syntax is worse) described, not how they are described.
There are indeed deep and obscene books that spend hundred of pages detailing how to abuse C++ features to achieve simple goals, working around crazy quirks like this. I am not providing links to these grimoires of horror to avoid encouraging such practices, and there are already quite tragic suggestions online.
One of these is to work around the impossibility to partially specialize functions and member functions by embedding each in a class (a bit like a functor) which can then be partially specialized. Awesome...
060612 Sustainable ADSL traffic
Having slightly updated my sabishape script for traffic control configuration, I have changed the default uplink rate limit to 200kb/s from 220kbs. The reason is that I have done some experiments and on a 288kbs ADSL uplink 200kb/s is what can be carried without causing congestion. The effective bandwidth available for UDP or TCP payload is around 23KiB/s or around 185kbs/s.
So the insane number of levels of encapsulation inside ADSL (L2TP, PPPoA, ATM, ...) costs around 40% overhead or 30% wastage, and the much leaner IP/TCP costs additionally around 9% overhead overhead or 8% wastage, so that a link theoretically capable of 36KiB/s can realistically handle only 23KiB/s.
Similar numbers for the 576kb/s downlink, which also must be limited (but while the reason for limiting the uplink is to shape it to prevent ACK packet delays, the reason for limiting the downlink is to prevent queue buildup on the incoming side of the DSLAMs (or whatever). The sustainable highest downlink bandwidth is around 470kb/s on my line so I limit it to around 440kb/s (ingress traffic shaping is a lot less precise than egress shaping), or roughly 52KiB/s for payloads.
This is significantly higher than 2 times the effective value for the uplink, which has 2 times the nominal bandwidth (576kb/s instead of 288kb/s), and I suspect that this is because the downlink side is better optimized/buffered than the uplink side.
060606b Constructor, destructors, C++ and terminology
While discussing the idea that non virtual destructors can be used to indicate a class should not be inherited from, a friend suggested that can be achieved by making constructors private, and that initialization can then be a separate, explicit called method, also because constructors are fragile (exceptions throw inside them are not well defined).
The problems that I think occur with this are that it is yet another perversion of C++ features to do something that C++ was just not designed to do, and that one really needs at least the no-parameter constructor for arrays of that type.
Well, I reckon that C++ constructors should be used strictly to ensure that the the value of the object is well defined, not that it is fully set-up. That is, for example, that all pointers in the representation are initialized to 0. In particular no memory allocation should be performed inside a constructor; if the object has to be initialized with pointers to some other objects, the other objects should be allocated by the caller of the constructor, and passed to it as arguments. Or else the constructor should just initialize all fields to some neutral value and then proper setup should be done in a set-up method.
Then rereading the paragrpahs above I realize that there are two (or three) important terminology issues, about the words object, constructor and destructor.
First, I used object above in its proper C++ sense, which is memory area, just like in C. That in C++ terminology object means something very different from class instance, and there is indeed no single word to signify class instance, and that C++ is probably by far the single most popular object-oriented language has caused almost infinite confusion, and is very regrettable.
Unfortunately such definition of object was also unavoidable, as it is defined that way in C, and C++ was designed to have as much as possible of C compatibility, even including the terminology. That in C++ an object is a memory area is why this code is perfectly valid and has a well defined meaning:
class Complex
  float r,i;
  Complex(const float r = 0.0f,const float i = 0.0f)
  : r(r), i(i) {}
[ ... ]
  Complex(1,0) = Complex(0,1);
[ ... ]

Now, if an object in C++ is not a class instance, what is the class instance? Well, it is the value, represented as a bit pattern (or more precisely byte pattern) contained in the object. This means that two distinct objects can contain the same class instance, exactly in the same way that in C the objects called i and j both contain the same int type instance in this line:
static int i = 3, j = 3;
So in this piece of code:
struct Coord
  unsigned x,y;
  Coord (const unsigned x,const unsigned y) : x(x), y(y) {}

Coord c1(512,384);
Coord c2(512,384);
the two objects may be (represented as) the 8 bytes at addresses &c1 and &c2, and both objects contain the same instance of class Coord, which may be (represented as) the 64 bit sequence:
00000000 00000000 00000010 00000000 00000000 00000000 00000001 10000000
As a side note, there are very few values in C or C++ that are not necessarily stored inside an object, and they are manifest constants, like 3, or named manifest constants, like enumeration type members, arrays and functions (that arrays are named manifest constants in C and C++ is almost universally unknown, and the nature of functions is almost as obscure).
Given that in C++ all classes are aggregate types, the role of a constructor is to ensure that the class instance stored into an object is well defined. Nothing more nothing less.
The very term constructor is a classic computer science term: properly speaking, all values, including class instances, are atomic in the sense that they are not composed of parts. So the instances of this type:
struct Point { float x,y; };
are not composed of two fields, they are just values, typically (represented as) a sequence of 8 bytes on most current computers (equivalently, an 8-digit base-256 number). A function that maps from multiple values into a single aggregate values is called a constructor and the C++ usage of the term is fairly consistent with the computer science usage, except that it conflates into it assignment. C also has constructors, for example in this line:
static Point p = { 1.0f, 1.0f };
the constructor in the computer science sense is (more or less) { 1.0f, 1.0f } and in the C++ sense it is (more or less) = { 1.0f, 1.0f }.
That in C++ constructors are mostly as in proper computer science terminology also explains why they are class-related, not class-instance-related, functions: because their role is precisely that of mapping to a class instance.
Note: there is another important use of constructor in computer science, and it is when qualified as type constructor: a type constructor is a function that maps from a set of types into a new type. For example in C struct and * (as in int *) are type constructors. In C and C++ and most other languages type constructors do something more than just constructing a type, they also define the constructor and the destructors.
Unfortunately the term destructor has a completely different meaning in C++ than in computer science, and in C++ the use of that term is misleading because it suggests that it sort of does the opposite of a constructor.
The functions that are called destructor in computer science discourse are indeed those that do the opposite of what a constructor does, that is they map a value of an aggregate into a value of one the types used to construct it, so for example in this type definition
struct Point { float x,y; } p = { 1.0f, 2.0f };
there are two destructor functions called x and y, which when applied to an object containing an instance of the Point type, as in p.x (C and C++ use a special postfix syntax for destructor functions), return one of the values used in constructing it (I would have preferred the use of the term deconstructor for destructors though).
Actually technically the C/C++ syntax for data member names disambiguates to a destructor if used in a rvalue context, and to a constructor in a lvalue context like p.x = 0.0f (which is really a shorthand for p = Point(0.0f,p.y), that is the assignment of a different class instance to the object called p).
The use of destructor in C++ is very misleading because it actually names what should probably be called tear-down (or perhaps terminator), because their role is not to undo the work of the constructor, because there is no way to un-construct a value, but to undo the set-up performed by any set-up method called after the class instance has been constructed. Which setup usually involves, if any, establishing relationships to other values (e.g. by allocating resources).
This role of the C++ destructors is indeed class-instance specific, because the relationships to be torn-down are in general class-instance specific.
060605 The new Microsoft ClearType oriented fonts
Microsoft Typography have prepared some fairly good looking new fonts for MS Windows Vista, and I have found an article with type samples and some interesting comments.
One of the comments says that several of the ClearType snapshots have heavy color fringing (ignore another comment that different screens need different tuning; virtually all have pixels in RGB order). Indeed, it is ugly, and reinforces my opinion that anti-aliasing for fonts is not such a good idea, and that ClearType-style antialiasing results in too much color fringing in too many cases.
Also, as another comment points out, these fonts rely on ClearType or at least some grey level antialiasing, as they are not that well hand-hinted. It surprises me that a corporation with a lot of money like Microsoft in an era with cheap offshore labour cannot spare the change to have their fonts hand-hinted. Perhaps this is meant to increase the leverage of the developers of ClearType and/or help get a bonus for keeping costs down, even if by very little.
The amusing side of this is that these not-too-well hinted fonts result in glyphs that are about the same quality as those from the autohinters in FreeType2, especially the Type 1 autohinter which seems to me to do a better job than the TrueType one.
Of these fonts, Consolas is available as a free download, and the others apparently are bundled with the MS Office 2007 beta packages.
060604 No to keyboards with black keys
While trying to move around the stuff in my home I took a break to read the May 2006 issue of PCFormat and I was disappointed to see a group test of 10 fancy (PCFormat's main mission is to market expensive stuff to game maniacs) keyboard, of which 9 had black keys, and only one white keys.
Black keys with white lettering are a rather bad idea, because as they keys smudge the lettering becomes ever less visible. With keys and black lettering the white surface becomes grayer with time, but this does not really impair much the readability of the keyboard.
060603 Linux based PS3 as PC replacement
Some time ago I was speculating that Sony's ambition was to displace low end PC sales with their PS3:
This strategy also makes more sense for Sony, because the price of Xbox 360 and PS3 is comparable to that of many low end PCs, and there is no question that Sony is trying to harm Microsoft's sales of OS licences as much as possible: low end PCs are ideal for Microsoft, because the OS licenses they sell are priced per-unit, not as a percentage of the sale price, so they make a lot more money when two US$400 PCs are sold than one US$800 PC is sold.
Which suggests that Sony will not only deliver GNU/Linux on PS3, but that will quite deliberately include something like or KOffice as every MS Windows or MS Office license sale that a PS3 displaces helps Sony cut the air supply of Microsoft.
and it looks like that Sony have confirmed such strategy authoritatively:
Besides defending the PS3, Harrison took time to evangelise the device, which will launch worldwide this November. In particular he said the Linux-based operating system on the console's hard drive will have enough processing power and non-gaming functionalities to render traditional PCs -- most of which use a form of Microsoft's Windows OS -- moot in the home.
"We believe that the PS3 will be the place where our users play games, watch films, browse the Web, and use other (home) computer functions," said Harrison. "The PlayStation 3 is a computer. We do not need the PC."
060602 C++ and separating interface from implementation
I have been writing somewhat quickly some sample code with and without SSE for some geometry work. Well, I have been as usual rather annoyed that it is not quite easy in C++ to separate cleanly interface from implementation (never mind pragmatic).
The reason it is not quite easy is that C++ was simply not designed with the idea of a clear separation of interface and implementation. In part this may be because of its C roots, and the attending 50s-style linker technology it presumes (something that has been lamented by Stroustrup himself).
In part however because C++ is largely a compilation of old ideas, principally of Simula 67-style prefixing (reduced to class derivation also called inheritance).
For example it is not a good idea to separate interface and implementation using the C++ way of doing class inheritance, because:
  • Quite ridiculously, inherited member functions must be re-declared (not merely re-defined) in derived classes (this also opens up issues with inadvertently shadowing instead of inheriting).
  • Since C++ uses contravariance (or more precisely its subset, invariance) for parameters it is not possible to define a more specialized version of a method in a derived class, which makes it almost impossible to safely define methods specialized on the representation of the class type.
Well, the way to go is to use templates, using a declaration only template, that is a template that contains only declarations of class methods but not definitions, and inherits (or contains) from the template argument type, and then provide the definitions of the actual representation types, and the related methods, using template method specialization, as in this example, which which these are some excerpts:
template<typename R>
struct Point3I: public R

    Point3I(const float x,const float y,const float z);

    Point3I(const Point3I<R> &that);
    Point3I<R>     	&operator =(const Point3I<R> &that);

    float   	    	xGet() const;
    float       	yGet() const;
    float       	zGet() const;

    float       	xPut(const float x);
    float      		yPut(const float y);
    float       	zPut(const float z);

    Point3I<R>     	&sub(const Point3I<R> &that);

    float       	dot(const Point3I<R> &that) const;
    float       	norm2() const;
struct Point3S
        __m128        	q;
        float          	v[4];
        struct	        { float w,x,y,z; };

    Point3S(register const __m128 q) : q(q) {}
inline Point3I<Point3S> &Point3I<Point3S>::sub
    register const Point3I<Point3S> &that
    this->q = _mm_sub_ps(this->q,that.q);

    return *this;

In that example Point3I is the template that describes the API, Point3C defines one representation of the underlying type and the specialized Point3I<Point3C> methods define the implementation of the API methods for that representation, and similarly for the alternative implementation type, Point3S (for the SSE version of the representation).
The implementation type may be, almost equivalently, the base class for the template class, or used to define a data member within it. In the first case access to the representation members is less verbose, but one must define some constructors for the implementation type, in the second case one must access the representation via an explicit name, but one can define just the data members in the representation class.
As usual all this is quite tortuous and is a bit of an unobvious use of the facilities of C++, but to me it looks almost acceptable; except that it is based upon templates, and C++ rules for templates are many and obscure.

May 2006

060528 Coarse parallelism and pipeline duplication
Discussing coarse parallelism with a friend, another good example came up other than cars and Gran Turismo which is sports-like games with lots of ragdolls, for example football or baseball games. In these a major bottleneck is animating and skinning each character, which are the focus of a player's attention, that are thus important to render well.
Parallelizing the rendering process of each character is very hard, because that process is a mostly serial pipeline, and the independent, and thus parallelizable parts of each step in that pipeline are small and thus would require expensive fine grained parallelism.
But all the characters are essentially independent from each other, and the rendering pipeline can be quite independently instantiated for each and assigned to a different processor.
This actually happens in many if not most simulation games which currently rely on a certain scale of environment: in large environments most moving entities are widely separated, and can be processes quite independently.
Surely pileups happen in first-person shooters or team-based sport games or racing games, but these are the exceptions rather than the rule, and often even entities in contact can have many independent properties.
All this is not exactly entirely new: military battlefield simulations have often been based on the idea of dedicating even a full computer to each major entity in the simulation, like a tank. The logic again again here is that entities in a battlefield simulation tend to be quite distant from each other, as nowadays most weapons are ranged and powerful, and dispersal is the norm.
060524 Non-virtual destructors in C++
I was chatting with someone today about virtual destructors in C++, and when they are not appropriate. My argument was that destructors should routinely/always be virtual, because there is no Java-like final in C++, so any class can always be derived from.
His argument, which startled me, was that non-virtual destructors can be used to indicate that a class should not be used for derivation. I can see the point, but to me it is a bit dangerous: what if the class is used that way?
Well, I guess that the point here is that on relies on warnings from the compiler, and treats them as errors. Well, I suppose that all compilers will indeed warn if there are virtual methods in a class with a non-virtual destructor, as if the latter meant final, but relying on this is yet another wistful attempt to turn C++ into something else.
060514 Quick write speed test for NFS and CIFS
Well, I could not resist doing some more quick write tests to add to yesterday's network read tests:
Network file system write test
223.1MiB 20k inodes
Test 4KiB
CIFS tree 2m01s
n.a. n.a.
NFS tree 2m09s
CIFS file 50s
n.a. n.a.
NFS file 1m01s
For comparison, writing the same tree on the local disk of the client system (a laptop with a slow 4200RPM 2.5" drive with a top sustained 21MiB/s transfer rate) takes 1m36s (2.3MiB/s) and on the server takes 28s (8.0MiB/s).
The results above show that for writing CIFS and NFS are about equivalent, except that NFS is much faster if asynchronous writing is permitted, which is not a surprise (Samba is configured to not flush on every operation. It is a bit more of a surprise that block size has essentially no effect on time.
The notes for these tests are the same as those for the read tests with these additional ones:
  • The files to be written were in a LZO compressed .tar, which shrank them to aroud 78MiB, and the source archive was in a memory resident tmpfs filesystem.
  • The Samba server was configured with sync always and strict sync disabled, so writing was not synchronous on the server.
  • It is not clear to me whether the kernel cifs driver does support the async option for mounts. When I tried that the times were the same as for the sync option, while for NFS they were very different.
  • Ideally I should have re-made the target filesystem on every write, but since it was very big (84GiB) and very empty I just kept writing new trees/files to it (without deleting previous ones), so as to take advantage of the contiguity of the free space.
060513 Quick read speed test for NFS and CIFS
I have been for years reminding people that Samba shares mounted via the CIFS driver are faster and more POSIX compliant than NFS shares mounted via the NFS filesystem driver, for Linux-to-Linux file sharing. This sounds a bit counter intuitive, but:
  • NFS is in practice used only for POSIX-POSIX file sharing, and that has a small user and developer base. SMB/CIFS instead has a huge user based, and also because of competition with Microsoft, a lot more effort has been spend in tuning Samba, and the parts of the Linux kernel Samba uses, than in tuning NFS on most platforms.
  • The NFS protocol is admirable simple (and stateless) but it makes implementing semantics close to POSIX hard or inefficient; the more complex (and stateful) SMB protocol with Samba's UNIX extensions actually maps better to POSIX semantics.
The Samba HOWTO itself then states:
Generally, you should find that Samba performs similarly to ftp at raw transfer speed. It should perform quite a bit faster than NFS, although this depends on your system.
However as it then say things vary with time and context and usage. So I decided to do yet another file system test for SMB/CIFS and NFS. The test is smaller scale than my usual ones, in part because part of the test is about the file of network access when the files accessed are wholly cached in RAM on the server.
Because many people use remote file access to their home directories or to shared source trees, I have used as test data the sources of Linux 2.6.16, which amount to 223.1MiB in around 1,200 directories and 19,200 files, the test being a full tar repacking.
For comparison I have also testing reading speed for a single file containing the same data (its tar archive, uncompressed). The tests have been quick and minimal being just for reading, not writing or searching or deleting.
Network file system read test
223.1MiB 20k inodes
Test 4KiB
CIFS tree 1m18s
NFS tree 2m30s
CIFS file 46s
NFS file 1m04s
Well, the summary is that usually CIFS beats NFS significantly for access to uncached files, and especially for a tree, but if the file(s) are fully in the cache on the server then NFS can be significantly faster. Probably Samba is better at doing back-to-back IO and the Linux kernel NFS server is better at being low overhead. Overall I would use CIFS, with a 4KiB block size or 8KiB if sequential accesses are going to dominate.
Notes on how the figures above have been obtained:
  • The times above are elapsed time on otherwise quiescent systems. Program system time was usually insignificant at between 3-10s. Overall system time observed during peak IO (8-9MiB/s) was around 25-30% on both client and server.
  • Linux kernel 2.6.16, with in-kernel NFS server, and with Samba 3.0.22 user space server for CIFS.
  • The filesystem was mounted read-write on both the server and the client, and with atime on the server and noatime on the client. I tried setting atime on the client too and times are longer by not significantly.
  • NFS was mounted always with async, to be more comparable to stateful CIFS. I also tried with sync and even in these read tests there was some difference.
  • Block sizes are set to the same value for both read and write. I also tried 12KiB and 16KiB blocks, but while they are a bit faster (for this sequential access test) that is not significant.
  • I have no certain idea why NFS access to the tree is significantly slower with 8KiB blocks. I suppose that this is because many files are smaller then 8KiB (out of 20k, around 6k under 4KiB, 8k under 6KiB and 10k under 8Kib).
  • I checked that during cached tests the server would not read from disk; it would still write to disk the updated inodes. But of course the data should not be cached on the client, so the network filsystem on the client was unmounted and remounted before each test.
  • Before each uncached test I unmounted and remounted both the disk filesystem on the server and the network filesystem on the client to ensure that no blocks would remain cached.
  • The network connection is 100MHz full duplex Fast Ethernet via a switch, with a Realtek 8239C on the server and an Intel EthernetPro 100 on the client.
  • The server has an Athlon 64 3000+, 1GiB of 400MHz DDR-RAM, and the tests were on a 160GiB disk capable of 60MiB/s sustained sequential and with an 8ms average access time.
  • The client has a Pentium III 800Mhz, 256MiB of 133Mhz EDO RAM.
060512c What is OO and why it works
OO programming is one of the more misunderstood techniques, and most of the definitions one reads in self-important texts and textbooks are gravely inappropriate. The historically appropriate one, the one that works is:
It is a program decomposition paradigm
That is, it is not a way to organize a program text or a program execution, but a way to subdivide a program into parts.
Programs are decomposed into modules, each module contains exactly one type and all and only the functions that access its internal representation.
This is based on the assumption that inter function dependencies are most often because of sharing access to the representation of a type (another way of saying this, used in one of the few non ridiculous books on OO, is that each module contains exactly one ADT implementation). By putting all functions that depend on that representation inter-module dependencies are minimized, and by putting only the functions that access that representation, each module's cohesiveness is maximized. Most OO language sort of check that all relevant functions are part of the module, but none or very few check that only the relevant functions are part of it.
Usually but not necessarily, the program text organization reflects the program organization
Often OO languages enforce the restriction that the text of the declaration or even the definition of the functions that belong to a module must be contained in the declaration or the definition of the module alongside with that of the type. This restriction however is not part of the core definition.
The rationale for this approach is based on the truly fundamental and thus amply forgotten research on capability systems and on modular programming by Parnas, and can be summarized in this way:
How do we solve the software crisis?
By systematic program development.
What is systematic program development?
A discipline of programming based on adherence to some principles.
What kind of principles?
Simple, structured program design to conquer complexity. Maximazing various aspects of reuse to conquer scale.
How is reuse maximized?
By decomposing programs into explicit modules.
What kind of decomposition?
One that minimizes the interconnections between modules.
What kind of modules minimize interconnections?
Clusters of procedures that operate on the same representation.
Why is this supposed to be effective?
The details of representation often imply lots of interconnections.
When is it difficult to apply this simple principle?
When a procedure is strongly related to more than one representation.
When does this simple rule break down?
When some procedures are more strongly interconnected by something else than sharing access to a given representation.
I would also like to emphasize that OO is not anything about these popular and laughable misconceptions:
A data model, in particular the network one
OO programming is about structuring programs, not data. It does not say anything about how to organize data, and might be applied to programs that use data according to any data model, including the network or relational ones. Most OO languages have data structuring primitives (structures, arrays, pointers) whose data model somewhat resemble the network data model, but that is accidental.
A model of computation, in particular the actor one
OO programming is not about how programs run, their semantics, but how programs are structured. Program semantics are defined by the model of computation they are designed for. The actor model, based on asynchronous, self contained actors copying data to each other (messages) is somewhat intrinsically OO, but OO is not equivalent to the actor model. It is unfortunate that an important family of OO languages, the Smalltalk family, was initially based on the actor model, and then switched to the applicative model, without changing terminology.
Related to inheritance
OO program structuring requires just that modules contain one data type and all and only the functions accessing its representation. Relationships and reuse among modules may be done with inheritance (a particularly lame choice), or several other ways, including delegation for example.
Related to unique ids or polymorphism
There are many, many languages that have polymorphism (whatever that is) and unique ids, and that are nowhere like OO. There are also OO languages that do not have unique value ids or polymorphism. Unique ids tend to be a bad idea, and so-called polymorphism, a term that covers confusingly different techniques, is often a good idea, but one that ia valuable to any style of programming.
060512b Inlining at the call site
In my previous entry I have mentioned that The value of many optimisations depends on the static or dynamic context in which the relevant code is used, and there is one particular example of this that has long been mishandled by most languages, and that is inlining of functions.
Inlining can has both benefits and costs, among the costs a potential increase in the size of code, among the benefits that compilers can merge the code of the function in the code of its caller and optimize the two together, especially valuable if the function is a leaf function, or some of the arguments are manifest constants or not.
A rarely made observation is that both the benefit and costs depend mostly if not exclusively not on the function but on where it is called. For example inlining a function at most call sites will bring almost negligible benefits and just increase overall code size, while inlining the same function in a hot spot, for example inside an often executed loop, can bring very large benefits with a small increase, or even a decrease in code size.
Unfortunately most languages quite dumbly allow only designating function definitions as inline, not function calls. Admittedly function definitions whose calls may be designated as inline should be marked specially, so that the compiler keep around that definition when an inline call is found.
What should therefore happen is that function definitions should be marked as inlinable to indicate that they might be inlined, but only when the call is marked as inline. For example:
inlinable void Vec3Add(float *const a,const float *const b)
   a[0] += b[0], a[1] += b[1], a[2] += b[2];


   for (int i = 0; i < stripLength; i++)
      inline Vec3Add(&strip[i][0],displacement);

There are similar inanities for example as to the expansion of generics: for example usually manifest generics (for example templates) usually are always expanded inline, but most often they can just be handled as closed or manifest generics (for example as if inheritance was involved).
Another example of an improperly wide optimization is with Java, where the final keyword can be used on a class or method definition to indicate that the entire class or method deal with manifest typing, always; but it would be more appropriate to allow such use before a specific use of the class or the method (with suitable checks to ensure this is not misused).
Conversely many languages with closed but not manifest type system (for example Objective-C and Common Lisp) can detect contexts when value types are in fact manifest and inline only where possible.
060512 Open, close or manifest type systems, programs and libraries
I have been occasionally discussing the merits of OO programming and things like templates and dynamic dispatching, and I have realized that as usual these are widely misunderstood, in part because some vital context is not often communicated.
Part of this context is that it matter greatly whether one is talking of programs or libraries, and whether they have an open or closed or manifest type system. As to this:
  • a program is some body of code that is complete as it is, and is meant to be executed by its own thread of control;
  • a library is one that is meant to be used by another body of code (another library or a program), and by the thread of control executing a program;
  • an open type system is one where some types are known only at link time or at run time;
  • a closed type system is one where all types are known by inspection of the program text, at compile time, but the particular type of the value denoted by a variable may not be known at compile time (sometimes these are called typeless type systems).
  • a manifest type system is one where both all possible types and the type of the value of each variable are known at compile time.
Usually programs have manifest or closed type systems, most libraries have manifest type systems too, with many being closed, and some, in particular libraries implementing containers, having open type systems. It also matters practically whether a closed or manifest type system has few or many types, because a closed type system with very many types may be more expediently handled as an open one. Some examples:
  • ls is a program with a manifest and very small type system.
  • The Linux kernel is a library with a closed and fairly large type system.
  • The STL for C++ is a library with an open type system.
  • Thanks to blessing Perl programs can have open type systems, but a Perl implementation has a closed type system.
Whether something is a program or a library matters greatly as to optimization, and whether a type system is open or closed, and/or how many types matters greatly to program organisation:
  • If a program has a closed, or more often a manifest type system, especially one with few types, then often the use of a language with a sophisticated type system which can handle dynamic or static genericity.
  • Since a program is known as a whole, optimisation can depend on a knowledge of the whole, while that of library cannot depend on a knowledge of the context in which it is used.
    The value of many optimisations depends on the static or dynamic context in which the relevant code is used, and in particular on knowledge of the types of operand values, and more particularly on manifest knowledge of such types.
For example programs like games tend to have small and manifest type systems, which can well be handled and better optimized profitably in C or Java, rather than in C++ or Python. Conversely if a program or a library have a closed or open type system with many types, such as for example a media library, it can be cumbersome to use C or Java, and it is easier and probably more optimisation friendly to use C++ or Python.
Part of the reason is that small and manifest or even closed type system can be handled without using overloading and genericity, or using those sparingly, and this opens up the opportunity to do special case, contextual optimisation; conversely the complexity of handling genericity and overloading, especially of the dynamic sort, measn that simulating them in a language that does not directly support them can lead to obscurities and inefficiencies.
060511 PS3 Gran Turismo HD halfway demo
It appears that at E3 recently Sony had a half-way demo of Gran Turismo HD for PS3. The demo was half-way in that some of the art assets were clearly the same as those in the PS2 version, and the article comments:
As great as Gran Turismo HD looked, and as much as we were pleased to see and play it, it seemed like a strange offering for Sony to open its pre-E3 event with.
But I don't think it is strange at all: Gran Turismo is after all the signature game for Playstation just as Halo is for Xbox and Mario for Nintendo.
I reckon that in effect Playstation consoles are or should be designed as Gran Turismo platforms, and that if there is a logic to Sony's hardware choices, it is (or should be) Is this going to work well for Gran Turismo?. For example the Cell's 6 or 7 SPEs seem meant to provide coarse parallelism for Gran Turismo, each handling one ot two cars, probably with one of them or the main CPU dealing with the scenery. This is possible because almost all the time in a racing game cars do not interact with each other, but almost only with the the scenery, and then with very little of it (terrain and lights).
As to the low quality art assets the article reports that:
From a distance all of the cars and environmental features looked really impressive, but up close it was difficult not to be distracted by the occasional jaggy and the textured liveries of the race cars that clearly weren't designed with anything like resolutions of 1920 x 1080p in mind.
The suicidal spectators on the Grand Canyon rally track also gave away the demo's roots, since their low poly models and PlayStation 2 textures appeared to have received very little PS3 love ahead of today's event.
Polyphony Digital is, of course, working on a Gran Turismo game for the PS3, but we'll be very surprised if the final product ends up looking anything like today's demo.
Here my impression or perhaps hope is that perhaps the launch version of Gran Turismo, the one being developed, will have use the coarse partitioning afforded by the SPEs to have dynamically generated art assets, and in particular textures.
060510 Quick speed test for Reiser4
Feeling the temptation to do a quick speed of the Reiser4 file system I yielded to it, and I have repeated my usual speed test pattern on my usual smaller root filesystem. The test setup is not directly comparable to previous ones, because my root filesystem has been churned a bit in the meantime (and I have deleted a lot of small files and added some big ones, so it now has 44k directorie and 225k files), so I have redone the JFS speed tests in the same conditions to provide a reference point.
Desktop filesystem speed tests
7.1GiB 269k inodes
File system Free space Restore
fsck Repack
Find Delete Notes
Reiser4 4KiB 2286MiB 7m39s
2m20s 5m14s
1m50s 1m59s umount: 6m20s
JFS 4KiB 2249MiB 10m24s
56s 4m54s
1m11s 5m45s umount: 1s
The way I have done it is similar to that for previous tests, with the major change that the restore was done not from a pre-made tar archive, but lazily by a dynamically generated tar archive fed via a pipe.
My main impression from such a simple test is that Reiser4 feels faster than ReiserFS, and broadly equivalent to JFS, especially when takes into account that Reiser4 does not seem to flush to disk modified metadata; the latter accounts for the very long umount times after most operations.
For example this seems to happen after find, as it updates the accessed time of each inode, and it seems that Reiser4 delays writing the modified inodes for a long time, or until umount, causing a flurry of widely scattered writes during the latter. This probably also accounts for the big difference in the elapsed time of the test where all files get deleted.
As additional data points, a straightforward disk-disk image copy of the partition holding the filesystem, which is 9.8GiB, takes 3m16s elapsed and 42s system time, resulting in a rate of 51.2MiB/s; just reading it sequentially take 2m55s elapsed and 24s system time, resulting in a rate of 57.4MiB/s.
Which means that tree copying runs at around 25-30% of maximum achievable and tree reading runs at around 40-45% of the maximum achievable, which is not too bad considering the size of the tree and that is has many scattered small inodes.
060509 Summarizing the font situation
I have been trying to summarize for someone I was trying to help the font situation for GNU/Linux, this time without any explanation (and with some simplification), and here it is:
  • Make sure that all your font directories (including those for bitmap fonts) are listed in the same order (bitmap fonts first, then Type 1, then TrueType) in both the /etc/fonts/local.conf file for Fontconfig and in the /etc/X11/xorg.conf file for the X11 server (or the equivalent file for the X11 font server). For example I have something like this:
        FontPath 	"/usr/share/X11/fonts/misc"
        FontPath 	"/usr/share/X11/fonts/100dpi:unscaled"
        FontPath 	"/usr/share/X11/fonts/75dpi:unscaled"
        FontPath 	"/usr/local/share/fonts"
        FontPath	"/usr/share/ghostscript/fonts"
        FontPath 	"/usr/share/X11/fonts/Type1"
        FontPath 	"/usr/share/X11/fonts/TrueType"
  • Make sure that you have as few fonts available as possible, because having many is very expensive indeed.
  • Make sure that you start all KDE apps with kwrapper, or kfmclient for Konqueror.
  • If your screen DPI is quite close to 100 or 75 (for example 96 or 72), use the -dpi argument to the X server to set it explicitly to exactly 100 or 75.
  • Get more good bitmap fonts especially if available in a range of sizes and in roman, bold, italic and bold italic.
  • Enable the bytecode interpreter of FreeType2 if you can.
  • Get well hinted scalable fonts if you can and the bytecode interpreter has been enabled.
  • for X11 server fonts:
    • Ensure the type1 font module is not specified in /etc/X11/xorg.conf and the freetype one is.
    • Specify your fonts with either 0 as the DPI if they are outline or with the actual value of the screen's DPI if they are bitmap.
    • Use fonts.alias files to redefine existing bitmap font pixel sizes for different DPIs if your screen DPI is not very close to 100 or 75, for example like this one for 85DPI.
    • Prefer bitmaps fonts, and then Type 1 fonts unless the TrueType fonts you have use are well hinted and the bytecode interpreter is enabled.
  • for Fontconfig:
    • Always set hinting to true.
    • If you have well hinted fonts and the bytecode intepreter enabled:
      • set autohint to false;
      • set antialias to false;
      • choose your default fonts to be either bitmap fonts or the will hinted fonts (for example Georgia, Verdana, Andale Mono).
    • If your fonts are not well hinted or the bytecode intepreter is disabled:
      • set autohint to true;
      • consider setting antialias to true;
      • set rgba to none (unless you are not bothered by fringing);
      • choose your default fonts to be bitmap fonts (for example Adobe Times, Helvetica, Courier), or else Type 1 fonts or DejaVu or Vera.
It is a bit sad that something so long and complex is just a simplified summary, as the situation is actually far more complex.
060507 A nice working IT8212F chipset card
Having recently ordered a parallel ATA host adapter card with a SiI680 chipset I received instead instead one with an IT8212F one. Well, I decided to keep it because I have already tested another card based on the same chipset and found it unreliable.
The new card is not from Q-Tec but from NEWlink instead, another reseller of cheap things. Well, usually NEWlink stuff works somehow, and this version of the IT8212F reference design (the two board have the same layout, even if the Q-Tec one is a bit more compressed in a smaller board) actually works reliably, unlike the Q-Tec one.
But there is an interesting detail: top sequential transfer rates with it on my disks are decent but rather lower than with my other card based on the CMD0649 (that is, SiI0649 nowadays), as in 45MB/s instead of 60MB/s.
Perhaps the IT8212F chipset has been found to be flakey are higher transfer rates, and the NEWlink card limits deliberately its speed to the reliable range, or more likely it just happens that since this is a pseudo-SCSI host adapter it simply has higher latencies and lower top transfer rates.
060505b Using SSL with name based websites
It is usually stated that all site using HTTP over SSL but be IP based and not name based, because of the compelling argument that the server needs to know which SSL key to use to decrypt the headers of the incoming HTTP request, but this depends then on the name of the site, which is contained in those headers.
Quite right, but this suggests that there are a couple of loopholes:
  • IP based sites can share the same IP address, as long as the port is different. This means that one can bind different sites using different keys to the same IP address, if using a non-standard port for all but the first (this is hinted at in the Apache documentation itself).
  • If all name based sites that share the same IP address (and port) have the same SSL key, the problem does not arise.
In the second case, the problem is then that as a rule SSL keys contain the name of the site inside themselves, so they can only be used for a single site. But this is not necessarily so, and the restriction exists only because certification authorities that sell keys to the public want to charge a fee for every invidual name.
Technically it is possible to put multiple names in an SSL key, either by using a wildcard name (which is recognized by many but not all SSL enabled browsers and applications) or multiple DNS names in the subjectAltName field of the key, which can be done using self-signed keys, or by finding a flexible certification authority (if any exist).
So in the special cases where all the name based sites have names all under a single prefix domain (wildcard subjectAltName) or they are all known at key generation time, and it it acceptable for all of them to have the same SSL key, then it is possible to have name based sites supporting SSL connections, by specifying that key for all of them, statically.
060505 Two papers on costs of garbage collection and reference counts
There is an ancient debate over the costs of garbage collection and reference counting. The overall story is that:
  • The time cost of garbage collection is proportional to the number of live blocks at the time the collector is activated, and the space cost to that of dead blocks accumulated.
  • The time cost of reference counting is proportional to the number of accesses to live blocks and to the number of blocks freed, and the space cost to the total number of blocks.
In other words two very different profiles, which makes it hard to compare like-for-like. Even more intriguingly however there are a few other points of deeper detail that have great weight:
  • Garbage collection by letting dead blocks accumulate makes the ratio of address space to data space consumed greater than it should be.
  • Reference counting by updating counts twice every access to a block greatly reduce the ratio of memory load to stores.
Memory hierarchies have become ever deeper, but system designers have devoted those caching levels mostly to speeding up memory loads than stores, relying on applications to be designed to do many more loads than stores, and reference counting runs directly counter to that.
For this reason I reckon that it has been several years that reference counting has been a very bad idea, and I saw papers demonstrating that it can take around 30-40% of the total running time of some applications (mostly because of the large number of stores it requires), making garbage collection a definite winner, especially if one considers the complications needed to ensure correct reference counts, especially troublesome with multiple threads.
There are some ways to lessen the impact of reference count updating, for example the ancient technique of putting reference count updates into an intention log to be condensed and run through once in a while. I recently re-examined that technique with perplexing results. Also, a somewhat recent paper by Boehm, one of the authors of the classic conservative garbage collector for C, shows that delayed reference counting may cost significant space overhead (text).
Then someone mentioned to me a recent paper comparing the cost of omniscient deallocation against garbage collection (text) where some tests show that garbage collection costs a lot more memory (or trading that off, a lot more elapsed time, as paging increases. That quite perplexed me, because while garbage collection delays reclaiming freed memory, thus decreasing the density of live blocks (unreclaimed dead blocks are not available from allocation), that should not amount to an overhead of 400% as reported by the paper:
We find that GenMS, an Appel-style generational collector using a mark-sweep mature space, provides runtime performance that matches that provided by the best explicit memory manager when given five times as much memory, occasionally outperforming it by up to 9%.
With three times as much memory, its performance lags so that it performs an average of 17% slower. Garbage collection performance degrades further at smaller heap sizes, ultimately run- ning an average of 70% slower.
Explicit memory management also exhibits better memory utilization and page-level locality, generally requiring half or fewer pages to run with the same number of page faults and running orders-of-magnitude faster when physical memory is scarce
The comparison is between a garbage collector and a memory manager that just knows when to reclaim a block, without doing any liveness tracing (because that has been done in a prepass). Thus the difference in performance can depend upin two distinct causes: tracing, and delayed deallocation.
It would be interesting to know the separate impact, because unfortunately tracing is unavoidable, one way or another. Giving more memory to the memory manager has the effect of making collections less frequent, thus reducing the average cost of tracing, but also reduces the cost of delayed deallocation, by the same.
My understanding of the literature, which is mentioned in the article, is that tracing in a garbage collector has a cost lower than in most other designs, and that delayed deallocation actually is not that expensive.
Also, the comparison was done for running Java programs, and regrettably much Java code assumes that deallocation is essentially free. Unfortunately programs designed to run with a garbage collector should try to minimize their garbage generations rates, which is a lost art, so I would not be surprised if the programs tested were sort of worst case for a garbage collector.
Overall the main conclusions that I drive from the article are that tracing is a very expensive operation (but I have my doubts it must be as expensive as their measurements imply), delayed deallocation costs need to be studied more, and an application designed to run with a garbage collector should be implemented to minimize garbage generation, but the very availability of a garbage collector perhaps suggests to programmers that memory deallocation, being invisible, seems free.
060504 Turning down fan RPMs for a quieter PC
I have installed a few days ago some Zalman Fanmate2 fan speed controller in my PC, to reduce the rotation speed of my CPU and case fans as they seemed a bit too noisy and the speed was probably a bit excessive for the cooling needs of my PC.
lm_sensors to monitor fan speed and temperature I have been able to tune the speed of the CPU fan from 2700RPM to 2000RPM and that of the case fan from 2200RPM to 1500RPM, at which they are nearly inaudible and still keep my system cool, with CPU temperatures between 35-40C. (I have a 90nm Athlon 64 3000+ which draws only 51W at full speed, so thats sort of easier than with most other CPUs).
My PC's major noise sources now are the obnoxious if not very loud whine of the fan on my NVIDIA 6800LE card, which is small and whirrs a bit too fast (harder work to reduce that, as that involves replacing it), and the four hard drives, which by choice are among the quieter and less obnoxious ones, but four hard drives add up.
060502c Halo for PC and what is a CPU bound game
When my PC had just a 1.6GHZ Athlon XP and a GF3 Ti200 video card playing Halo for PC was not much fun: it was quite slow, having difficulty reaching 30FPS. Even upgrading the video card to a GF6800LE did not help much, but now that I have an Athlon 64 3000+ Halo for PC zips along, with FPS being consistently above 60FPS (sometimes by a lot) on 1280x1024 and all graphics goodies to the max.
This is more remarkable than it seems, and it belongs in the discussion of just how CPU bound are current games. Of course how CPU bound is a game depends on the game, but also on the CPU (and how good or bad is the code in the game: as to this my perception is that PC game programmers in particular write the sloppiest code they can get away with and then hope that CPU speed improvements save them). My impression is that recently released PC games target systems with at least an Athlon 64 3000+ or a Pentium 4 3.2GHz (in other words systems in the £600-800 price region, which is sort of midrange today), and I suspect that is in part because that is the sort of PC on which games are developed.
This however sort of leaves behind all low costs systems and any PC older than a year or two. I wonder if disregarding such a large number of potential customers is wise. One of my favourite games of all time is Tribes 2 and when it was published it was awesomely advanced and could only be played if at all on top systems. That Dynamix, its developer, no longer exists is perhaps dues also to the resulting sales problems. I would contrast that with the rather wiser attitude at Epic Games as to the engine used in the Unreal Tournament series of games, which has had even software rendering for a long time, and would play well, with reduced glitz, on a lot less hardware than many other less successful games.
Now, going back to Halo for PC, the astonishing thing is that it is a port of Halo for Xbox, which performed decently well on an Xbox, which had hardware not more powerful than my Athlon XP 2000+ with a GF3 Ti200. That it performed pretty poorly on that has always made me suspect that it was deliberately pessimized to run decently only on much more expensive PCs than the Xbox, which of course tallies with releasing it on PC much later than on Xbox. Of course even Halo 2 has not been released on PC yet.
After all Halo and Halo 2 are signature games for the Xbox and Microsoft want it to driver Xbox sales, not its own sales or PC sales where Microsoft is already doing very well. As to this, it is interesting to note that Halo 2 will be deliberately crippled to run only on the upcoming MS Windows Vista version, again in a transparent attempt to use it to drive sales of that platform.
060502b Updates to the ALSA sound notes and configuration
I have been looking again at my ALSA sound notes and in particular I have added a section on how to do common sound tasks like recording, and an entry on a mixer setup problem to the troubleshooting section.
I have also slightly revised my neat sample ALSA library configurations which I have finally managed to make parametric. Figuring out how to define a parametric PCM device stanza was not that hard, but getting the hang of how to instantiate it was much harder, because of the baffling way needed to do it: apparently the only way is as a string. Bizarre to say the least, but then ALSA seems often like that.
060502 Discoveries about Fontconfig configuration files
I have been revising and updating my sample configuration for Fontconfig in a few ways:
  • I have added to the aliases those for the GNU Free fonts both to the Fontconfig standard names and the Adobe 35 ones.
  • I have changed the defaults for auto-hinting and anti-aliasing to true as fonts that are not well hinted look too bad without, and set target="pattern" and binding="weak".
  • For fonts that are known to be well hinted I have reset the auto-hinting and anti-aliasing to false, assuming that one has the FreeType2 library with the manual hinting code enabled.
The latter task was made easier by a discovery: while in a match element the test elements are and'ed, one can obtain most of the effect of or'ing by specifying the attribute qual="any" on the test tag. This will allow to set its contents to multiple value, for example as:
<match target="pattern">
  <test name="antialias" compare="eq"><bool>true</bool></test>
  <test qual="any" name="family">
    <!-- Microsoft web fonts -->
    <string>Andale Mono</string>
    <string>Arial Black</string>
    <string>Comic Sans MS</string>
    <string>Courier New</string>
    <string>Times New Roman</string>
    <string>Trebuchet MS</string>

  <edit name="autohint"
    binding="weak" mode="assign_replace"><bool>false</bool></edit>
  <edit name="antialias"
    binding="weak" mode="assign_replace"><bool>false</bool></edit>
which avoids a lot of duplication (still necessary if one want to or a test on two different properties, not on two different values of the same property).
I still think that Fontconfig/Xft2 client side fonts are pretty bad news, even if they seem unstoppable now. Too bad, because I discovered another amusing problem: if one sets a font attribute in a configuration file, like anti-aliasing, that cannot be overridden in the font query; so for example if a configuration file sets anti-aliasing to false for Verdana, one cannot use the font query Verdana-10:antialias=true to reversethat.
Also, I question the wisdom of putting an XML parser in a runtime library, but then perhaps it is not so bad, even if the UNIX way is to have simple tabular configurations.
However in the various experiments that I have done the auto-hinter of FreeType2 performed pretty well Type 1 fonts, which come out pretty well even without anti-aliasing.

April 2006

060430 Sample Fontconfig configuration files
I reviewed a bit the font setup on my GNU/Linux system, and I have rewritten (from the bundled originals) the Fontconfig configuration files in /etc/fonts/, and they are available as samples. Despite the warning I felt it was necessary to modify the fonts.conf file.
The major plus points of these configuration files is that they are fairly well commented and documented, and they attempt to define standard well know font names, either specific (e.g. the Adobe 35 names) or generic (e.g. the Java font class names) and assign to them actual font names in a flexible way.
I also found a not so little problem: configuration past some point seems to be ignored, perhaps because there is some limit on a buffer size or on the number of configuration entries.
060429 Freetype2 font rendering and quality
Well, I was trying to explain to someone the complicated tangle of messes that is the situation of fonts under GNU/Linux; one of the them is the quality of rendering of fonts with FreeType2, especially scalable fonts with anti-aliasing.
This particular mess happens because at small raster sizes the positioning of each pixels in a glyph raster matter a great deal (there are few of them), and cannot follow idealized shapes closely.
This means that either the apparent resolution of the monitor has to be increased with anti-aliasing or the raster has to be manually adjusted (a process called hinting).
An additional complication is that the canonical way to hint TrueType fonts manually is patented in many countries, and a further one is that manual adjustement is (probably intentionally) a rather expensive process for TrueType (in Type1 fonts the hinting is mostly automatic).
All this has created several complications:
  • Almost always fonts with some degree of openness have no or little hinting (for the same reasons why open source programs often have scant documentation: it takes time, it is boring and it brings less recognition than doing a cool prototype).
  • Several proprietary fonts contain very good hinting.
  • Open source rasterization libraries cannot use hinting support by default because in some countries it must be licensed.
  • Fully automatic hinting, which can be done, does not work as well as semi-automatic hinting, at least for TrueType, and it is hard to do well.
  • Anti-aliasing causes rasterized glyph to loop significantly different and many people (including myself) find the results hard to focus their eyes on.
These complications have given rise to the following practical choices:
  • Bitmaps fonts are well supported by FreeType2, the library that almost all font systems use in the free software world, and they are of course in effect fully manually hinted. Unfortunately very few people are making new bitmap fonts, and it is a significant effort too.
    One accidental advantage they have is that at least in the USA they cannot be copyrighted or patented as strictly as scalable fonts can, because tyhe latter are in effect programs.
  • Open (to some degree) scalable fonts are usually poorly hinted and therefore they should only be used with anti-aliasing. Unfortunately many people object to that, so it is hard to recommend their use.
    The fully automatic hinting algorithms are not quite good enough for non antialiased rasterizing, but can improve a bit the look of anti-aliased rasterizing.
  • Fully hinted proprietary fonts look good with the manual hinter, but not very good with the fully automatic one.
    Some people enable the manual hinter code even in countries where it is patented. Some people even make illegal use of scalable proprietary fonts for which they don't have a proper license.
    However arguably if one has a PC with proprietary software installed can use the fonts bundled with that software under GNU/Linux too, because most proprietary fonts are licensed by host or by printer, not by OS.
Some people and distribution makers think that the obvious solution is to use open scalable fonts with auto-hinting and anti-aliasing. But while this mostly works, it does make fonts rather fuzzy.
Here are a few screenshots to illustrate the bleak alternatives:
  • This one and this one show text rasterized with a a proprietary well hinted font (Microsoft Verdana and Andale Mono) using the patented manual hinter, on the left drawn sharply in black and white without anti-aliasing, and on the right with anti-alising. The left side looks very good, the right side as well, but rather fuzzier. It looks to me like there is no gain from anti-aliasing here.
  • This one shows a well hinted TrueType font (Microsoft Verdana) rasterized with the manual hinter on the left and with the automatic hinter on the right. The manual hinter rasterization is very much better.
My favourite solutions are in order of descending preference:
  • Use bitmap fonts.
  • Enable the manual hinter in FreeType2 and use well hinted proprietary TrueType fonts, in particular the Microsoft web fonts.
  • Use Type1 fonts, like the URW or Computer Modern ones, or semi decent TrueType ones like DejaVu, with autohinting and no anti-aliasing.
  • Only in desperation, use non-hinted scalable TrueType fonts with the autohinter and anti-aliasing.
060424c When did the UNIX style start to fade?
While doing my large-filesystem speed tests I was dismayed as usual by how un-UNIXy the various fsck programs are, in particular the ReiserFS one that by default not only prints lots of stuff, but also interacts with the user:
reiserfsck 3.6.19 (2003

** If you are using the latest reiserfsprogs and  it fails **
** please  email bug reports to, **
** providing  as  much  information  as  possible --  your **
** hardware,  kernel,  patches,  settings,  all reiserfsck **
** messages  (including version),  the reiserfsck logfile, **
** check  the  syslog file  for  any  related information. **
** If you would like advice on using this program, support **
** is available  for $25 at **

Will check consistency of the filesystem on /dev/hdi11
and will fix what can be fixed without --rebuild-tree
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you do):Yes
reiserfsck --fix-fixable started at Mon Apr 24 22:44:39 2006
Replaying journal..
Reiserfs journal '/dev/hdi11' in blocks [18..8211]: 0 transactions replayed
Checking internal tree..finished
Comparing bitmaps..finished
Checking Semantic tree:
No corruptions found
There are on the filesystem:
        Leaves 20104
        Internal nodes 128
        Directories 5551
        Other files 79784
        Data block pointers 17168285 (53185 of them are zero)
        Safe links 0
reiserfsck finished at Mon Apr 24 22:46:37 2006
This made me wonder when did people start to ignore the principles of the UNIX pragmatics style, and well, a good candidate is the very first fsck program:
File System: /

** Checking /dev/rxx0a
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Free List
236 files 1881 blocks xxxxx free 
and this is taken from the 2.9BSD documentation, that is around 1980. Pretty sad. The original fsck was from System III, which indeed contained a whole lot of the non-UNIX practices (like using .d as a directory suffix in /etc.
060424b Summary of fsck times
Looking again at the more recent file system speed tests I have done it might be interesting to look specifically at the fsck times, and those previously recorded:
Desktop filesystem fsck times
File system Small filesystem
6.7GiB, 420k inodes
Larger filesystem
65.3GiB, 85k inodes
ext3 4KiB 4m31s 3m25s
JFS 4KiB 2m14s 35s
XFS 4KiB n.a. 33s
ReiserFS 4KiB 2m34s (?) 2m06s
It is quite clear that checking time is mostly proportional to the size of the metadata to be checked, and that metadata is mostly inodes, and secondarily to the space in the filesystem.
The table above is a bit misleading not only because the two filesystems for which I have done measures are so different in both dimensions (inodes, average file size), but also because the larger filesystem was measured on a disc that is around 50% faster in both average access time and transfer rate than the other.
However, some impressions can be obtained; very approximately it seems like that there is a time cost of 1-2 minutes per 200k inodes. If one extrapolates linearly from the larger filesystem, which has an 800KiB average file size, a filesystem with 6.5TiB of data in 8.5M inodes will take at least 50 minutes to check, and one with 65TiB of data in around 85M inodes at least 11 hours.
If one extrapolates from the smaller filesystem, which has a 16KiB average file size, those times must be at least 4 times larger.
Again, these are optimal times, assuming that there the filesystem is freshly restored and there are no errors. One can imagine that recovery in a filesystem with damage and fragmentation be a lot slower.
060424 Some larger filesystem informal speed tests
I was wondering how much my simple tests on file system speed from last year scale up to larger filesystems with larger files, and the recently upgraded CPU and memory.
By far the most complex filesystem I have is the root one, which contains a bit over 280,000 inodes and 7.5GiB in a 9.8GiB partition. But 7.5GiB is not a lot. So I chose my currently largest filesystem contains a bit over 80,000 files and 5,000 directories and 65.3GiB in a 83.8GiB partitions. It contains mostly source and binary archives and media files, but also some trees of fairly small source files.
The numbers that I have gotten as to that are, for a freshly reloaded filesystem:
Desktop filesystem speed tests
65.3GiB, 85k inodes
File system Free space Restore
fsck Repack
Find Delete Notes
ext3 4KiB 18.1GiB 53m06s
3m25s 26m06s
30s 3m36s -m 0 -N 450000
JFS 4KiB 18.4GiB 52m07s
35s 28m05s
40s 2m19s  
XFS 4KiB 18.4GiB 48m12s
33s 27m52s
30s 1m01s xfs_checkxfs_repair
ReiserFS 4KiB 18.4GiB 57m57s
2m06s 31m50s
1m01s 1m59s -o notail
My main impression from these numbers is that for larger average file size and/or a fast system there is less difference between the various file systems than for filesystems with lots of small files. I still like overall the profile of JFS, even if I wish it had post-creation bad block handling.
Some notes:
  • The measuring methods previously described has been used.
  • The elevator used was CFQ.
  • I use shorter than default pdflush parameters, and a filesystem read-ahead of 32.
  • This time I have omitted most of the system CPU time when they were relatively insignificant.
  • I have also specified -m 0 when building the ext3 filesystem to remove the minimum reserved space and thus see the full size of free space.
  • The system has an Athlon 64 3000+ CPU, 1GiB of DDR-RAM, and 2x250GB plus 2x160GB discs on three separata ATA 100 channels. The tests have been done on an otherwise unused 250GB ST3250823A disc with a 60-40MiB/s transfer rate and a 9ms average seek time. The partition used was towards the inner part of the disc, thus more in the 45MiB/s transfer rate region.
  • The problem with restoring is that it involves both writing to the target filesystem, what we want to measure, and reading from the source. To minimize the impact of the latter I have put a tar image of the filesystem to be restored on another quiescent disc. I would have compressed it too to minimize the read traffic, but most files in that filesystem are already compressed.
    A more accurate if less realistic approach would have been to produce a list of all file names and sizes in the filesystem and to create a script that creates them and fills them with zeroes to that size.
  • The times for fsck are those for a full check, not for a cursory one.
  • I haven't done a comparison of the time needed to reread the filesystem when used and when reloaded because that is an archival filesystem to which I usually add files, without deleting or overwriting existing ones.
  • It is very tedious indeed to run these tests involving dozens of gigabytes, as they take a long time and I can't do anything on my PC while they run.
060423b Filesystem free space and fragmentation
I recently wrote elsewhere that:
The problem is indeed intrinsic: as the filesystem nears 100% usage the chances of finding contiguous or nearby free blocks when writing or extending a file becomes a lot smaller.
If the anti-fragmentation effects of a 5% reserve are dubious, it is only because 5% is way too low. It should be at least 10%, and ideally 20-30%.
On reflection this is almost right, but not quite, because while it is true that the greater the percentage of free blocks the greater the chances of finding contiguous free blocks, beyond a point that does not matter much.
The goal is not to reduce seeks to zero when reading a file, but to make them infrequent enough that sequential reading dominates. So for example on a disc with a 60MiB/s or 60KiB/ms sequential read rate, and a 9ms average seek time, one might be happy enough with wasting like 10-20% of time on seeks, which can be achieved by having contiguous block runs of 540-300KiB.
So what matters is not to find a very long sequence of free blocks, but one long enough relative to the cost of moving the disc's arm, and that depends also on the absolute amount of free space, as perhaps more than the percentage.
060423 Distasteful Microsoft bashing
It was big disappointment to read a disgraceful Microsoft bashing article which contains a number of point that I think were unfairly biased against Microsoft. There are some very good arguments in favour of Linux, and legitimate complaints about Microsoft and their products, and it is a disgrace when flawed ones are used.
I think that the most glaringly obnoxious part is:
He said, "When you are dealing with rootkits and some advanced spyware programs, the only solution is to rebuild from scratch. In some cases, there really is no way to recover without nuking the systems from orbit."
In other words, Windows users may have no choice but to wipe their systems down to the bare-metal and then reinstall the operating system and applications.
This not uniquely a MS Windows affliction; GNU/Linux systems should also be reinstalled when a rootkit has been detected on them. It may be conceivably possible to effect a cleanup, if one knows very well the whole subject, but then it would also be possible with MS Windows (even if probably it would be easier on GNU/Linux). Still the commonly given advice for both is to reinstall.
Another laughable point is about cool tricks in the new Vista GUI:
Thurrott has this to say: "Anyway, the reality of glass windows is that they stink ... But the visual difference between the topmost window (that is, the window with which you are currently interacting, or what we might describe as the window with focus) and any other windows (i.e. those windows that are visually located "under" the topmost window) is subtle at best. More to the point, you can't tell topmost windows from other windows at all. And don't pretend you can."
I don't know about you, but that's got me all excited about Aero.
Well thats funny because the two most exciting projects in Linux GUIs, XGL and AIGLX, are used to do amazing things like ahem, glass windows (and even worse, ahem, wobbly windows).
I also find it quite distasteful to mention well meaning self criticism from the Microsoft camp and use it against them:
I quote: "Since the euphoria of PDC 2003 [Microsoft Professional Developers Conference 2003], Microsoft's handling of Windows Vista has been abysmal.
Promises have been made and dismissed, again and again. Features have come and gone. Heck, the entire project was literally restarted from scratch after it became obvious that the initial code base was a teetering, technological house of cards. Windows Vista, in other words, has been an utter disaster. And it's not even out yet."
And people thought I was hard on Vista!
There is more than one reason why mentioning self criticism as to poor project management and software engineering like this to attack Microsoft is distasteful:
  • That's exactly how many free software projects work. Shoddy code, false starts, inflated expectations... The number of stillborn or abandoned or awful projects on is a clear illustration. Lots of parts of the Linux kernel itself, according to Linus Torvalds himself, are not that good or well maintained. The weakness of MS Windows is not that its project management or software engineering are uniquely poor, because that happens with free software too, a lot; it is that unlike with free software there is no alternative.
  • Using admissions of fault to attack projects fosters a climate not of discussion but of cultish propaganda, Bush-style. There are very many cases where free software advocates are disappointed with it and criticize their own side. Should they shut up to prevent their opponents from quoting them? Or isn't self criticism healthy, and Microsoft (or free software) supporters not drinking their own Kool-Aid a sign of their maturity?
As to technical merit, MS Windows is not much worse than GNU/Linux; in a first approximation they are broadly equivalent, doing much the same things in the same way, with both suffering from much shoddyness and pointless glitz, and much the same security and performance issues. There are technical differences but they are differences of degree, neither side has a large lead in any particular area.
Even if GNU/Linux has a core based on UNIX architecture which is rather more flexible and cleaner than the confused mixup that are the MS Windows core APIs, that advantage is being lost thanks to the Microsoft contagion.
The defining difference is not that GNU/Linux can be cleaned of rootkits and MS Windows can't, or that MS Windows has useless cool GUI tricks and GNU/Linux does not, or even that one is expensive and the other is not: it is that GNU/Linux is free as in freedom.
The source is available and there are no restrictions to its modification and redistribution, and any member of the community can work on it and adapt it to their if they so desire. It is freedom, freedom from the well documented abuses that Microsoft indulges in thanks to the power and control that they have over MS Windows.
As Stallman says, even if free software were not as good technically as proprietary software, it would still be preferable because of the freedom.
060422b Disc-to-disc defragmenting and backups
To help someone that was having some issues with a SCSI tape drive I have reconnected my old DDS-2 4m tape drive and run some tests with it, and it just reminded me of how incredibly slow it is.
I have been doing partition image disc-to-disc backups for the past four years, and that's a very different story; here is something from my nightly backup run (on a socket 754 Athlon 64 3000+ style PC) for two ATA 100 160GB drive:
'/dev/hdc1' to '/dev/hdj1': 321300+0 records in
321300+0 records out
10528358400 bytes (11 GB) copied, 203.717 seconds, 51.7 MB/s

real	3m23.789s
user	0m0.150s
sys	0m33.350s

'/dev/hdc3' to '/dev/hdj3': 192780+0 records in
192780+0 records out
6317015040 bytes (6.3 GB) copied, 113.145 seconds, 55.8 MB/s

real	1m53.206s
user	0m0.130s
sys	0m20.730s

'/dev/hdc6' to '/dev/hdj6': 1463420+1 records in
1463420+1 records out
47953350144 bytes (48 GB) copied, 1071.17 seconds, 44.8 MB/s

real	17m51.207s
user	0m0.760s
sys	3m49.660s

'/dev/hdc7' to '/dev/hdj7': 2745607+1 records in
2745607+1 records out
89968080384 bytes (90 GB) copied, 2489.91 seconds, 36.1 MB/s

real	41m29.980s
user	0m1.500s
sys	5m11.500s
Indeed the cheap cost of disc storage means that dangerous and slow things like in-place defragmentation of filesystems are no longer a good idea, at least for small to moderate size filesystem sizes, the sort that one has on a PC or a SOHO server.
For defragmentation by far the best way is to image copy a partition to another disc, and then reformat the source partition and copy back to it the files one-by-one. This has some important advantages:
  • The first operation is a backup.
  • An image backup is going to run very fast, especially one between two discs, the target one will be quiescent, and usually on a different channel too (I get 40-50MB/s).
  • Even the copy back will be about as fast as possible, as it involves a mostly sequential scan of the source tree and a mostly sequential write of the target tree (I get at worst 10-20MB/s).
  • The newly formatted, newly filled filesystem will then be just about optimally laid and freshly remade, without flaws or cruft.
A typical sequence of commands, simplified to the essentials, is like:
umount /dev/hdc1
dd bs=4k if=/dev/hdc1 of=/dev/hdj1
jfs_fsck /dev/hdj1
mount -o ro /dev/hdj1 /fs/hdj1
jfs_mkfs /dev/hdc1
mount -o rw /dev/hdc1 /fs/hdc1
: "One could use 'pax' or whatever else here"
(cd /fs/hdj1 && tar -cS -b8 --one -f - .) \
  | (cd /fs/hdc1 && tar -xS -b8 -p -f -)
umount /dev/hdj1
The root filesystem can be done this way too, either by using a liveCD, or by doing the image copy, rebooting setting the root to the copy, doing the tree copy, and then rebooting into the newly reconstructed filesystem.
Doing this should always be possible as one should always have a disc used for image disc-to-disc backups. In general each disc inside a PC should be paired with another disc of the same capacity (but ideally of a different make), with a script running every night to copy the first to the second. This is currently by far both the chapest and fastest way to do backups.
Of course one should also have some external way to backup, perhaps less frequently, and indeed I use external IEEE1394 disc enclosures for that. So every time I buy a hard disc I buy four (sometimes just three, living dangerously) of the same capacity (of different makes or at least different models, and from different shops), one to actually use, one for nightly backup to put into the PC box too, and two (or one) to put into external IEEE1394 boxes. This is still a lot cheaper than buying a tape unit and the equivalent amount of tapes.
One advantage of this arrangement is then that not only backup is amazingly fast and effortless, but if there is an issue with the primary disc, any of the backup discs can just be used in its place, as-is, that is essentially zero-time restore (I have done this several times).
Note that I do not do a RAID1 (mirror), but nightly images, and I keep the external boxes powered off most of the time.
There remain the problem of what to do for backup and defragmentation of very large filesystems that have constant availability requirements, and size of the order of dozens or hundreds of terabytes. Well, I don't really have any really good answers for those, and I doubt that anybody does. It is even difficult to imagine what to do if they ever need fsck.
The only thing I can imagine is fully replicated storage servers and remote mirroring, and switching servers rather than discs.
060422 Some other filesystem speed tests
Just found some recently made filesystem speed tests for ext3, ReiserFS, JFS and XFS.
The tests seem to be to be largely meaningless (even if not quite as meaningless as some others) because as usual they are mostly cache tests, at least according to this note on the method used:
The sequence of 11 tasks (from creation of FS to umounting FS) was run as a Bash script which was completed three times (the average is reported).
as it is apparent that there is no unmounting between operation, so the cached blocks of one test stay cached for the next test.
There are other obvious flaws with the article, for example the measurement of operations like mounting that are quite instantaneous regardless on such a small filesystem.
Overall fairly useless especially as my own notes include tests that are somewhat meaningful, and better constructed and explained and rather more comprehensive (among them: 050908, 050909, 050911, 050913, 050914, 050925, 051101, 051204 051226b).
060419 Online non-MMORPG game usage statistics
Someone has reminded me of an interesting page of online game usage statistics which is quite useful because GameSpy is quite popular. There are surely biases in the numbers, because for example Tribes2 players usually don't use GameSpy to find online servers, but it gives some order-of-magnitude figures.
And the figures are that CounterStrike and other Half-Life engine based games are overwhelmingly (four times) more popular than the next one, Battlefield2, which is overlwhelmingly more popular than the rest.
As to absolute numbers, the total for Half-Life engine based games (that is, almost only CounterStrike) plus Battlefield2 was around 150,000 players when I looked which is probably a small fraction of the number of MMORPGs players online at the same time.
Which all sort of supports my impression that the enormous success of MMORPGs is greatly reducing the number of people playing other games.
060418 Some speed/compatibility tests of USB2/FW chipsets
I have recently bought a new No Name combined USB2/FW PCI card, because my existing Adaptec one has a well know issue with USB2: it uses a NEC chipset that apparently only uses one PCI bus cycle in two. Both cards have internal USB2 and FW connectors and a Berg (floppy style) power connector so that they can act as powered hubs for devices needing more power than the PCI bus can supply.
So I decided to test them with my two combined USB2/FW boxes, to check compatibility and speed. The product involved are:
  • No Name USB2/FW PCI card, USB2 chip VIA VT6214L and FW chip VIA NT VT6306.
  • Adaptec AUA-3211 USB2/FW PCI card, USB2 chip NEC uPD720100A, FW chip TI TSB12LV26.
  • No Name ULT31311 external USB2/FW enclosure for 3.5" hard drives, USB2 and FW to ATA chip Prolific PL3057.
  • No Name (probably Triumph) TT-346U2F external USB2/FW enclosure for 5.25" and 3.5" disc and CD/DVD drives, USB2 to ATA chip ALi M5621 and FW to ATA chip Oxford Semiconductor OXFW911.
My tests were very simple and the results are:
Speed of various USB/FW combinations
Product Protocol Card chip Box chip Device Sequential
read speed
AUA-3211 USB2 uPD720100A M5621 HD 250GB not working
AUA-3211 USB2 uPD720100A PL3057 HD 80GB 18.0MB/s
No Name USB2 VT6214L M5621 HD 250GB 2.6MB/s
No Name USB2 VT6214L PL3057 HD 80GB 33.4MB/s
AUA-3211 FW TSB12LV26 OXFW911 HD 250GB 26.7MB/s
AUA-3211 FW TSB12LV26 PL3057 HD 80GB not working
No Name FW VT6306 OXFW911 HD 250GB 15.7MB/s
No Name FW VT6306 PL3057 HD 80GB not working
Intesting and quite varied numbers. The VIA chipset is a lot faster in USB2 than the NEC one, which I expected, and the Oxford Semiconductor one in FW. Quite unexpected is that the VIA chipset in USB2 is faster than the Oxford Semiconductor one in FW. Eventually I decided to keep the AUA-3211 as I trust FW a lot more than I trust USB2, even if the FW subsystem in Linux is not well maintained (but I have noticed some improvement in recent months, so probably someone is still paying attention).
Also I could not get the FW side of the ULT31311 box (PL3057) to work reliably or all (a Google search shows that is a common issue with the PL3057), and the USB2 side (M5621) of the TT-346U2F box either did not work or was excessively slow with hard drives. But then I tried it with a DVD drive and it seemed OK.
The state of USB2 and FW PCI and mass storage bridge chips is not good, and it has baffled even Jamie Zawinki. In part this is because there are too many semi-incompatible shoddily designed half-working chips, in part because the relevant Linux drivers could be described similarly. It does not help that in many cases there are also power supply issues.
Overall it is another instance of a caricature of Gresham's Law, one where bad software and hardware designs crowd out good ones, because:
  • Bad designs are marginally cheaper.
  • Bad designs are slightly quicker to do.
  • Most customers cannot assess quality.
  • Most managers don't care about quality.
  • Bad design are harder to understand and maintain, and thus results in better job security, especially if they are kept undocumented.
060416b Solid, HAL, D-BUS, sysfs
Dominated by the merciless Microsoft cultural hegemony GNU/Linux developers continue on a rampage of overcomplicated software architectures, trying to emulate the fragile complexity for which Microsoft's software products are justly famous, as reported by Linux Format in an article about Solid:
On a platform such as Linux Solid uses HAL to talk to devices, but on other systems such as Solaris, BSD, Max OS X, and Windows, Solid will use a HAL equivalent for the specific system.
Because HAL relies on Linux-specific functionality in the kernel, it is restricted to Linux. The kernel provides hardware information with Sysfs, which is in turn used by udev, which is in turn used by D-BUS, which is in turn used by HAL.
Now I like modular software, but this is getting a bit too modular; and things like sysfs and udev already look like masses of poorly designed and documented hacks to me.
060416 Retesting JFS performance over time
After updating to Fedora 5, which caused a rather extensive complete rewrite of my PC's root filesystem (JFS, format, 9.8GiB capacity, 7.6GiB used, on a 250GB/233GiB drive capable of around 57MiB/s), I have done again my usual JFS speed tests (complete repacking with tar of the filesystem) with these results:
Used vs. new JFS filesystem test
File system Repack
elapsed system
transfer rate
used JFS 13m03s 30s 9.0MiB/s
new JFS 04m53s 27s 24.0MiB/s
which gives a slowdown of around 2.7 which is not too bad for JFS compared with seven times for ext3, and is not much worse than the 2.1 times slower previously measured after upgrading just some parts like KDE.
For the past few months in order to measure the decay of performance as files were rewritten I have not actually written back the relaidout filesystem, but I shall be doing so now and periodically, as that is the best way to defragment.
Defragmentation in place is both very slow and dangerous; backing up to another disc drive and image copying the backup is both faster and safer. As to the additional space requirement, that's needed anyhow for backup, and of course one has to do backups.