Software and hardware annotations

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

September 2005

050929

As to one my favourite topics of poor memory (and IO) policies in Linux (and elsewhere) I found this OSNews discussions this article on OpenOffice.org startup times which resonates quite a bit with me.
Some misguided people say that KWord starts a lot faster; this is a pointless comparison, because KWord starts after all of KDE has started, and finds a very large amount of the things it needs preloaded (and not just because of the kdeinit technique that the article author mentions).
As the article author says, OpenOffice.org starts slowly in part because of rubbish policies in the kernel, as well as the usual issues with shared library fixups and so much else. For example, if advising were ever to be properly implemented... But hey, most top developers have plenty of RAM and large PCs, so they don't have that itch.

050925

While explaining my recent file system tests I realized that I need to explain with some detail two aspects of those tests that do introduce some special type of bias.

By unmounting the filesystem and then remounting it before each operationg the filesystem blocks are eliminated from the buffer cache. This greatly improves reproducibility, as it ensure that each test begins with no blocks already in memory. However, part of a filesystem design is how well it caches things, in particular metadata, so starting with no cached blocks is both less realistic and favours filesystems that don't cache that well.
In the tests, for example the find one, the time taken is reported with time find... which does not fully report the time taken by the test; for example with find the atime of all inodes is updated, and this means than when find ends, there will be many updates inodes in the block cache yet to be written to the disk. Similarly for data blocks in time tar x.... The problem here is that is it pretty difficult to account reproducibly for the extra time to fully complete the operation, and that not accounting for it helps file system designs that cache more aggressively.

Overall I reckon however that given the relatively small size of my block cache and the relatively large size of the test data, and from my observations, the overall effect is small, and in any case it is somewhat fair to ignore it, as in ordinary use there is even more noise.

050923

Argh.... Having decided to have a look at how /proc/sys/vm/page-cluster really behaves I have had a look in the Linux kernel sources and found these astonishing bits of code:

  	/* Use a smaller cluster for small-memory machines */
	if (megs < 16)
		page_cluster = 2;
	else
		page_cluster = 3;

int valid_swaphandles(swp_entry_t entry, unsigned long *offset)
{
	int ret = 0, i = 1 << page_cluster;
	unsigned long toff;
	struct swap_info_struct *swapdev = swp_type(entry) + swap_info;

	if (!page_cluster)	/* no readahead */
		return 0;
	toff = (swp_offset(entry) >> page_cluster) << page_cluster;
	if (!toff)		/* first page is swap header */
		toff++, i--;
	*offset = toff;

	swap_device_lock(swapdev);
	do {
		/* Don't read-ahead past the end of the swap area */
		if (toff >= swapdev->max)
			break;
		/* Don't read in free or bad pages */
		if (!swapdev->swap_map[toff])
			break;
		if (swapdev->swap_map[toff] == SWAP_MAP_BAD)
			break;
		toff++;
		ret++;
	} while (--i);
	swap_device_unlock(swapdev);
	return ret;
}

and both make me feel sick and depressed (why is left as an exercise to the reader :-)).
Conclusion: it looks ever more important to set /proc/sys/vm/page-cluster to 0 instead of the usual default of 3.

050921

Having decided to tranform my changelog for this site into a proper syndication feed, I had to decided which format and how. I was swayed by Dave Winer's arguments for RSS 2.0 mostly because it is quite simple and backwards compatible with RSS 0.91 which was and still is so popular; I do have sympathy for the arguments in favour of the polical correctness of RSS 1.0 which is based on RDF, but sometimes I get impressed by expediency too.
So I set out to find some DTD for both version 0.91 and 2.0 of RSS, for use with PSGML mode of [X]Emacs or perhaps other DTD validating editor like jEdit. I created then some suitable SGML CATALOG and also XML CATALOG, and template headers for RSS files:

<?xml version="1.0"?>
<!DOCTYPE rss
 PUBLIC "-//IDN Netscape.com/DTD RSS 0.91//EN"
 "http://My.Netscape.com/publish/formats/rss-0.91.dtd">
<rss version="0.91">
</rss>

<?xml version="1.0"?>
<!DOCTYPE rss
 PUBLIC "-//IDN Silmaril.IE/DTD RSS 2.0//EN"
 "http://WWW.Silmaril.IE/software/rss2.dtd">
<rss version="2.0">
</rss>

The RSS 0.91 DTD seems mostly fine, but the RSS 2.0 DTD is not quite right, as it based on the idea that A channel can apparently either have one or more items, or just a title, link, and description of its own which is not quite correct as this authentic sample attests.
It also imposes an order on the title, description and link subelements of item that is also quite wrong.
The RSS 0.91 DTD is conversely extremely lax as to ordering of subelements, so I fixed the RSS 2.0 DTD in an intermediate way, replacing the definitions for channel and item as follows:

<!ELEMENT channel
  ((title|link|description)+,
    (language|copyright
    |managingEditor|webMaster|pubDate|lastBuildDate
    |category|generator|docs|cloud|ttl|image
    |textInput|skipHours|skipDays)*,
   item+)>

<!ELEMENT item
  ((title|link|description)+,
   (author|category|comments|enclosure|guid|pubDate|source)*)>

The customary order is title, link, description, but the definitions above leave that unenforced as long as they precede all other subelements.
For extra value I have cleaned up my subtle RSS style in CSS and its accompanying RSS JavaScript helper.
The RSS style allows clean rendering of (most of) a RSS feed directly in a browser, if the browser supports CSS styling of XML, and Mozilla and Firefox, Konqueror 3 and Opera 8 do this pretty well. This can be achieved by inserting this processing instruction around the top of the RSS file:

<?xml-stylesheet type="text/css"
  href="style/rss.css"?>

The RSS JavaScript helper turns the link elements into active, clickable links, (only in Mozilla and Firefox) and to enable such transformations this element should be added as the last one in the body of the RSS file:

<script xmlns="http://www.w3.org/1999/xhtml" type="text/javascript"
    src="style/rss.js"></script>

050920

I got another complete machine lockup while doing a large partition copy with dd and using heavily a JFS partition at the same time. This did not use to happen before I switched to JFS, even if in normal interactive use by itself JFS seems now very reliable.
After looking at the malloc() tunables I also had a look at the kernel tunables for memory allocation and swapping. I set /proc/sys/vm/swappiness long ago to be way lower than the default, to 40, as the buffer cache does not work very well, because it used LIFO policies when most accesses are FIFO, and tragically file (and memory) access pattern advising are not implemented. The result is good.
While reviewing the others I noticed that I should make sure that /proc/sys/vm/page-cluster is set to 0 because prefetching and/or large pages (even worse) are a very bad idea. Well, the bad news and the good news are:

After an extended period of really heavy web browsing it turns out that JFS by itself does not quite bring much improved responsiveness beyond the effect of having a freshly loaded filesystem.
Setting the page clustering parameters seems instead to result in a large improvement in responsiveness when memory is tight and there is quite a bit of memory swapped out.

050919

Sometimes my caution concerning newly released stuff is overcome by other considerations, and I set about switching piecemeal my Debian install from being mostly Sarge edition to allowing Etch (that is currently Sid) which involves some wrenching ABI transitions. One of these transitions is from GNU LIBC 2.3.2 to 2.3.5 and this causes trouble.
The reason is that version 2.3.5 does more stringent malloc checks, and therefore some applications with previously undetected bugs now crash, and this happens right at the time of the installation of updates package.
The more stringent checks can be disabled by setting the environment variable MALLOC_CHECK_ to 0 (or 1), which of course is a bit sad.
However I deciced to have a look into GNU LIBC to see if there are other interesting allocator-tweaking environment variables, and indeed there are several, and I was lucky to find a page that lists them and several others in various parts of GNU LIBC. The ones relevant for allocation are:

MALLOC_TOP_PAD_: if the heap as to be grown at its end, add this much to the allocation in bytes.
MALLOC_TRIM_THRESHOLD_: if the heap has more than these many free bytes at its end, shrink it.
MALLOC_MMAP_MAX_: maximum number of blocks to allocate allocate via mmap.
MALLOC_MMAP_THRESHOLD_: blocks of this size or larger (in bytes) are allocated via mmap.

I also found some discussion and numbers about various aspects of GNU LIBC memory allocation in a nice if a bit old article.
I also found an high system overheads due to mmap allocations which reveals that handling larger allocations via mmap can be very expensive if these allocations are short lived. Similarly for reducing the heap size when its top can be freed.

050918

Today a happy moment for further ALSA understanding: with a bit of deductive reasoning I figured a way around some ALSA library plugin restrictions to achieve sw mixed, 2 to 4 channel, playback. The key was to (re)realize that for my CMI8738 card device 1 on the card is the multichannel one, but only in four channel mode. This allows the dmix plugin to use it directly.
I have also had another look at the poorly documented syntax for parametric ALSA library configuration entries so perhaps I will be able to replace the duplicate definition of my configuration for cards 0 and 1 with a single parametric one for most configurations.
On a completely different topic, a very wise note by my hero Ulrich Drepper on LSB and its technical and social aspects. The author is becoming suitably cynical while keeping his scruples, too bad for his career.
And some interesting comment on the endless cycle of software reinvention for which I have another example: I often look at the LMKL archive using an RSS feed, even if it is after all a mailing list, because it is a much more efficient way to fetch and read just the things that interest me, in other words to use the mailing list archive as a newsgroup.

050917

It turns out that most likely the -o noatime crashes, which recurred, are due to non-JFS issues. Also, this JFS quick patch seems to have removed one cause of trouble.
I have converted all my ext3 filesystems to JFS to see how JFS performance degrades with usage, after the sevenfold slowdown over time shock of ext3. It is obvious that virtually all file system tests and benchmarks happen on a freshly loaded filesystem, so there is very little incentive for file system authors to reduce performance degradation over time, all that matter is performance on a fresh load.
Overall KDE and X under JFS seem more responsive (in particular program startup) than under ext3; but this may be because it is a freshly loaded filesystem, or because of the switch from 4KiB to 1KiB blocks, rather than because of better latency or performance for JFS as such.
So I have created a copy of my JFS root filesystem as a 4KiB freshly loaded ext3 filesystem, and I have used that for a day. It feels better than the well used ext3 1KiB filesystem I had originally, but my impression is that it is not quite as good as JFS. perhaps because because some operations that take time involve directory scanning, and I have not enabled indexed directories under ext3, but of course all non trivially small JFS directories are indexed.
As to the surprise that it seems that under JFS Konqueror does not grow that crazily, but actually occasionally shrinks, which I speculated as mmap related well a freshly loaded ext3 with 4KiB blocks seems to behave like JFS. It can be that I am seeing things that are not there, or that he real issue is to have filesystem block size equal to page size, in which case perhaps the Linux kernel does some special mmap optimization.

050916 (updated 051011)

Quite a bit more testing of file systems, more later, and new surprises. Some performance profiles are highly anisotropic and nonlinear... The latest is that in order to get the full performance of my hard drives reported by hdparm -t I have to set them up with a soft readhead of 32 sector or more; 16 or less cause a huge falloff in the report speed, for example for my /dev/hda (a WD 80GB 7200 unit) from 40MiB/s to 13MiB/s, one third (with a readahead set to 24 sector it gets to 24MiB/s). Now that looks related to back-to-back transfers, probably because of the firmware in the unit, as the others are slower with a smaller readhead but not as much.
Huge readhead is of course going to help with sequential streaming accesses, but can lead to prefetching of stuff that is not needed otherwise.
I have also noticed in the recent OLS 2005 paper on ext3 evolution that as of Linux 2.6.10 ext3 locking is rather less coarse than before, which should help a lot with scalability to highly parallel benchmarks.
I also appreciated and like the emphasis the paper gives to stability and backwards compatibility, as well as recoverability, as goals for ext3. From the paper I learned that the indexed directories in recent ext3 version use indexes carefully designed to be on top of an unmodified directory data format, so even if the index is corrupted the directory is still readable.
The paper reported also the valiant attempts by some to add more modern features, like extents, most of which break backward compatibility, but I dislike them. If one wants extents there are already extent based file systems out there that are quite good. The goal of ext3 should be to be itself, not to mutate into something else, as its original author, Stephen Tweedie, stated some time ago:

So it provides scaling of the disk filesystem size and it allows you to make larger and larger filesystems without the fsck penalty, but it's not designed to add things like very, very fast, scalable directories (to Linux) to EXT2.
It's not designed to provide extremely efficient support for extremely large files and extent-mapped files and this type of thing that people have been talking about for EXT2. But really the goal is to provide that one piece of functionality alone.
And one of the most important goals for the whole project was to provide absolute, total, complete backwards and forwards compatibility between EXT2 and EXT3.

But then it it may be a case of job protection: if somebody's job title is ext3 developer it may be hard to talk oneself out of a salary by saying that things are fine as they are and they don't need to be developed further; the same logic as the constant innovation in marketing or pricing plans: if your job is marketing manager or pricing manager, it may be couterproductive to tell your boss that the current marketing campaign or price plan are just fine.

050915

More gripping discoveries with filesystems... Not unexpectedly my newly loaded JFS based file systems have resulted in a dramatic improvement in GUI responsiveness as widely scattered blocks in widely scattered files have been replaced presumably by contiguous blocks in nearby files.
But rather astonishingly my KDE apps seem no longer as memory mad and in particular they no longer grow crazily with time (so far). Their total allocated memory also shrinks occasionally, and using pmap I have checked that the anonymous mappings do indeed shrink.
I suspect this is not at all due to the switch to JFS as such, but to JFS supporting only 4KiB blocks, while my previous file system was ext3 with 1KiB blocks. I had often wondered just how on a CPU with 4KiB pages was mmap dealing with files broken in 1KiB blocks, and now I guess that:

mmap does not deal very well with non default block sizes that are smaller than the size of a page.
If mmap sometimes cannot deal at all with file blocks being smaller than pages then file IO will fall back to read and write via the buffer cache. Perhaps it turns out that this is not well tuned because most people use page-sized file system blocks.
There is something demented in either the VM subsystem or the memory allocator in GNU LIBC that handles very poorly the case where file system block size is smaller than the CPU page size. Recent versions of the allocator are supposed to turn large allocations into their own memory segments, and indeed pmap shows quite a few 4KiB, 8KiB and 12KiB anonymous mappings in existence for Konqueror (but there is a single 59KiB anonymous mapping, presumably the main allocator arena).
For executables, which are presumably mmaped into memory on exec, the alternative is to read them into memory where they get reblocked into 4KiB pages that then get swapped out. The dreadful suspicion is:
- copy-on-write applies to mmap'ed executables, which means all processes running the same executable share the same pages (minus the copy-on-write ones), whether or not they descend from a common ancestor;
- for executables that have been read on exec that creates a distinct copy.

To see deeper into this startling behaviour I am duplicating my root file system into first a newly loaded ext3 file system first with 1KiB blocks and then 4KiB block to be doubly sure.
Well, first surprise: my previous notes that overall JFS saved memory with respect to ext3 with 1KiB may not be totally reliable, because my new JFS root and its fresh ext3 1KiB give:

Space under `ext3` and JFS
File system	JFS	`ext3` 1KiB	`ext3` 4KiB
hdc1 (8032MiB)	7220 Used 781 Available	6495 Used 1061 Available	7230 Used 327 Available

This may simply be because the root file system (which in my cases includes /var and thus the Squid cache, as well as the library headers etc.) contains a very large percentage of small files, and thus the loss due to the larger block size is greater than for the gain on metadata; this seems reasonable as ext3 with 4KiB blocks has about the same space used as JFS which also has 4KiB blocks, but a lot less space available.
By the way, in the reading back of this optimally laid out fresh file system, ext3 with 4KiB blocks was 40-50% faster than JFS, which is slightly puzzling.

050914

As I want to experience what happens to the performance of well used JFS file systems, I have converted my ext3 ones to JFS; and I got my first metadata corruption (in the dtree of a directory) when unpacking a file from a FAT32 file system into a newly formatted JFS one, and this after the crash with noatime a while ago. It may be it is not the JFS code after all: it could be some dodgy code somewhere else overwriting things where it should not, after all I am using a bleeding edge 2.6.13 kernel.
It might be some hardware problem, so just to err on the safe side I have recently run memtest86 overnight and no problems were reported. I am also real sure CPU etc. temperature are low, and I have a superstable 550W power supply which is wildly overspecified for my box.
However, wonders never cease, and suprising news: despite the move from 1KiB blocks to 4KiB blocks the amount of free space has grown substantially, and at the same time the amount of space used for data has also grown, as reported by df -m:

Space under `ext3` and JFS
File system	`ext3`	JFS
hdc6 (4016MiB)	3332 Used 441 Available	3381 Used 620 Available
hdc7 (24097MiB)	20944 Used 1839 Available	21164 Used 2901 Available
hdc8 (9028MiB)	7997 Used 558 Available	8098 Used 900 Available

This miraculous situation is probably because while JFS has large blocks (which explains in part why the amount of space used has grown) it also probably has, especially in these conditions, a much smaller metadata overhead because:

JFS only allocates space for most metadata as needed, while ext allocates most statically and usually one makes sure it is overallocated.
For the metada that ext3 allocates dynamically, the indirect blocks in the file space tree, JFS uses extents instead.
While the number of indirect blocks is solely a function of the number of blocks in a file, the number of extents in a file under JFS depends also on how contiguous is the free space area.
Since the JFS numbers above are for a load into an empty file system, in which the free space area is entirely contiguous, it is likely that most if not all files will be described by a single extent.

These consideration suggest that:

JFS can be after all more space efficient than ext3 even when the latter has a smaller block size.
If there are many large files, ext3 with a larger block size might take less space than with a smaller block size, because the internal fragmentation at the tail is less important, and many less indirect blocks are needed because the file is chunked in many less data blocks, or in other words a lower number of bigger fixed size extents.
The available space reported in a JFS file system is not quite the same as that for an ext3 file system, as part of it will be taken by the metadata of newly added files, which is mostly preallocated for ext3.
As the free space area becomes less contiguous, the JFS filesystem will have rising metadata overheads because of an increased number of extent descriptors used, up to the limit of needing an extent descriptor per each 4096 byte basic block.
While reloading a file system into a freshly made filesystem does wonders for ext3 speed, it is likely that it also increases the available space under JFS, as files that previously needed several extent descriptors end up in a single extent.
In order to increase the chances of proximity in allocating inodes and data blocks, ext3 should not be fully allocated, its available space should never fall too low; indeed I think that the default 5% reserve is way too low, considering the sevenfold slowdown over time possible. The same probably applies to JFS, and doubly so, as a larger available free space reserve raises the chances that longer contiguous extents are found, and therefore that both speed and space occupied are better.

050913

More investigation of Linux filesystem performance and features. It has occurred to me that in the recent testing I did one of the main limitations was that speed tests were done on a freshly loaded filesystem, presumably one where layout was optimal, and that I had not tested the time taken to fsck (there are others, but minor -- I hope).
So I decided to take my main root filesystem, which is around 7GiB in size, and has been rather thoroughly mixed up by upgrades, spool work and so on, copy it to a quiescent disc first blobk-by-block, then file-by-file (thus in effect optimally relaying it out), both as ext3 with 1KiB block size (which is also its current setup) and as JFS. Then to apply the read/find/remove tests and a new fsck test.
Well, all this takes a long time, which is the main problem with extensive filesystem tests; because small scale tests are just not realistic (and a few GiB is at the lower end of plausibility too, unfortunately). I wish someone with more time and money did some more extensive tests (Note (060424): I have done some new tests with 60GB of data). But then too bad that most of those I have seen have been somewhat less than well constructed.
Another factor is the large number of kernel problems I have encountered, necessitating frequent reboots; the general principle that non default configurations are dangerous seems to hold; for example as I was writing a restore from a tar file on a vfat partition to a JFS partition just got hung, and I am about to reboot. Perhaps the combination of FAT32 and JFS transfers has not been used much...
However, as to the issue of well used filesystems, shocking news (and this is really a like-for-like comparison):

New vs. used file system test
File system	Repack	Find	`fsck`	Notes
used `ext3` 1KiB	64m10s 81s	06m43s 06s	06m44s 04s	13% non contiguous
new `ext3` 1KiB	09m12s 74s	03m03s 03s	04m31s 04s	1% non contiguous
new JFS 4KiB	11m56s 64s	02m50s 05s	02m14s 04s	558MiB free instead of 829MiB
new ReiserFS 4KiB `notail`	26m53s 70s	05m34s 06s	02m34s 16s	1293MiB free instead of 829MiB

Note (060416): more recent similar tests over time are also available.
I made really sure these are like-for-like comparisons; the file system is my root one (around 420k between files and directories, and 6.7GiB of data), and I have copied it for each test to an otherwise quiescent disc, first with dd to get it as-is, highly used, and then I reformatted the partition and used tar to copy it again file-by-file to get a neatly laid out version. For the sake of double checking I then rebooted into the newly created partition and rerun the same tests on the original file system, and the results were coherent with those above (the exception is that they were around 25% lower, as the original disc is 7200RPM vs. 5400RPM and so on).
For pure metadata based operations (find, fsck) the newly loaded version is roughly twice as fast; but for reading all the data it is seven times faster. To me this indicates that metadata (directories, inodes) is fairly randomized even in a freshly loaded version (and indeed running vmstat 1 shows very low transfer rates, and the disc rattles with seeking), but data is laid out fairly contiguously. But after repeated package upgrades and the like the data also becomes rather randomized, and indeed this is also borne out impressionistically by looking at the output of vmstat 1 and the rattling of the disc (a lot less).
In the table above there is also a line with JFS numbers; these are for the same stuff as a freshly loaded JFS file system. Since I don't have a well used JFS file system I have decided to convert my root one to JFS, and then in a month or two check out how much it degrades after the usual frequent package install and upgrade that I do. With JFS the speed as freshly loaded is a bit slower or a bit faster than for ext3 freshly loaded, but there is an extra 5% of space used as JFS uses 4KiB blocks instead of 1KiB (and there are lots of small files in a root file system).
In a further illustration that non default configurations are dangerous at one point I compared two otherwise identical JFS file systems one of which however was on all operations 2.5 times slower than the other. I then remembered that for the one that was being slower I had whimsically set the journal size to 30MiB, while for the other 2.5 times faster on I had let the journal size default to 32MiB. It has astonished me that such a small detail has such impressive impact on performance, but then I guess that the JFS code has never been tested with a 30MiB journal size...
Update: I have not been able to reproduce this result, and I have to the conclusion it was due to a bizarre hardware issue.
There is also a line with ReiserFS numbers too, as an outlier point of comparison (saves a fair bit of space, but it is much slower than either ext3 or JFS). No Reiser4 data out of arbitrary lazyness (it still needs to be manually patched into the kernel).
So for now the conclusion is: at least for ext3 with time the layout becomes rather fragmented, with extremly large impact on performance in at least some cases. The cost of seeking is so large that a raise in the non-contiguous percentage reported by fsck.ext3 from 1% to 13% involves a sevenfold decrease in sequential reading performance.
To avoid this file systems should be regularly straightened out by dumping them to something and then copying them back file by file.

050912

Another anedocte about ATA/IDE drives not flushing when asked:

However, the disk hasn't actually written out its cache yet. It lied to the OS / file system and said it had, but it hasn't, it's busy doing something else. Poof, the power goes out.
Now, the journal doesn't have our data, we've already cleared it out, and the file system, which is supposed to have been coherent because we fsynced it, is not, and it is now corrupted.
I have reproduced this behavior a dozen or more times on IDE based systems. The only way to stop it is to tell the drive to stop using it's write cache.

A while ago I had mentioned similar gossip and then added flush the buffer cache more frequently (the kernel one) as a possibly useful palliative.
Unfortunately from my investigation of filesystem features it turns out that only ext3 allow tuning the flushing frequency (which is also useful for laptops, where one wants to make it less frequent); JFS does not, and XFS has a policy of doing it as rarely as possibly, which they call delayed allocation because it raises the chances of being able to allocate a large contiguous extent, and to write to in a single block IO.

050911 (updated 060602)

I have collected and listed some mostly online references on file systems.

050910

During my filesystem experiments I have chanced on this nice SlashDot comment:

I've been using ReiserFS _EXCLUSIVELY_ since about 2.4.11 and I've never had a single problem. It's important to format with the defaults and not specify 'special' arguments to mkreiserfs or you can run into trouble.

which is a classic case of the social way of defining that a program works: it works if most users do not run into bugs. Usually such programs are misdesigned and misimplemented, so that they mostly do not work, and sometimes (usually only for a demo to the boss) they seem to work. Then the bugs most complained about then get fixed, and thats it.
The alternative is to design and implement the program so that is works almost always save for inevitable rare mistakes, which eventually get found and corrected.
As to the social definition of working, in 2.6.13 the XFS code crashes for blocksizes of 1024, the JFS code crashes if a JFS filesystem is mounted with -o noatime, and UDF if one deletes files.
The obvious inference is that very few users have used blocksizes other than the default for XFS, have used non default mount options for JFS, or have deleted files from a UDF filesystem, all rather plausible assumptions.
BTW, I have done some light testing (not with JFS of course) about mounting journaled filesystem with -o noatime and indeed some operations like long searches are faster, even if not dramatically. This is as expected, because each directory traversal and file read generates by default an access time update, that has to be journaled, and the journaling involves locking etc., and -o noatime avoids all of that. Probably the benefit is much larger on parallel systems.
I also think some points about reliability of various filesystems need to be expanded. XFS for example very aggressively caches updates in memory, in order to be able to coalesce them in large write-to-disc transactions. Considering that most discs handle reading a lot better than writing, this can be very worthwhile. Unfortunately it also means that crashes can do very large damage, with the loss of a lot of data updates.
Probably ReiserFS does the same. However as to ReiserFS there is another problem: its metadata is very tighly packed and not duplicated. This means that bad blocks in the metadata area can cause very extensive damage, even if it is one of the few (with ext3) that has full bad block handling. By contrast ext3 duplicated the superblock many times, and divides the disk into several semi independent cylinder groups.

050909 (updated 051125)

Some more notes to add to my filesystem tests, for now this comparison of features.

050908

For the sake of getting a very approximate idea of desktop filesystem performance under Linux I have done some mini benchmarks involving:

A PC with an Athlon XP 2000+ and 512MB.
Linux kernel 2.6.13 with no gui, otherwise quiescent.
A 4GB partition on a 160GB 7200rpm 100MHz ATA hard drive.
A .tar.gz of a SUSE 9.3 root filesystem (3,132,712,960 bytes uncompressed, 173,759 entries), chosen because it contains a lot of small files and a number of fairly large files.
The operations of restoring the filesystem, re-tar-ing it, finding a file based on a non-name property, and deleting all files in the filesystem.

The goal has been to see both the elapsed time and the CPU system time for each operation, and how much space is left free when the file system is empty, has been restored, and has been deleted (to see the space efficiency of the filesystem). I have taken reasonable precautions to have the operations not skewed (too much) by various sources of bias.
Note (060426): similar tests for a much larger filesystem on an upgraded PC are available.
The results for various types of filesystem (and various block sizes for some filesystems) are:

Desktop filesystem test
Filesystem	Code size	Free after `mkfs`	Restore	Free after restore	Repack	Find	Delete	Free after delete
`ext3` 1KiB	195,163B	3770KiB	5m22s 0m39s	783KiB	3m03s 0m27s	0m47s 0m01s	2m12s 0m06s	3770KiB
`ext3` 4KiB	195,163B	3747KiB	5m06s 0m30s	454KiB	2m39s 0m25s	0m38s 0m01s	1m19s 0m04s	3747KiB
JFS 4KiB	189,084B	4000KiB	5m38s 0m31s	683KiB	3m46s 0m21s	1m01s 0m03s	2m44s 0m05s	3988KiB
XFS 4KiB	549,809B	4007kB	5m05s 0m56s	720KiB	3m50s 0m35s	0m44s 0m26s	1m41s 0m27s	3923KiB
UDF 2KiB	72,157B	4016KiB	10m45s 1m07s	768KiB	2m55s 1m02s	1m40s 0m34s	n.a.	n.a.

Notes:

The system times above are probably not complete, as they related only the overheads directly incurred.
The repack, find, delete times above are for cleanly loaded data; it would be interesting to see how performance would degrade after a lot of file additions and deletions, which would presumably result in a less optimal layout.
The elevator used was the anticipatory one (except for UDF, for which cfq was used); some tests with the cfq elevator show slightly increased elapsed times, not surprising as the anticipatory optimizes throughput at the expense of latency, the viceversa for cfq.
The numbers for ext3 were with data=writeback, but tests showed that with data=ordered the restore took only a little more time, so the latter, which is the default, is good. With data=journal the restore took 40% more time.
The specific commands given were:
```
# gunzip -dc /tmp/SUSE.tar.gz | (time tar -x -p -f -)
# (time tar -c -f - .) | cat > /dev/null
# (time find * -type d -links +500)
# (time rm -rf *)
```
and the execution of each was preceded by unmounting, flushing the buffer cache, and remounting; for restoring the archive to be restored was on a different drive.
JFS (because of its OS/2 lineage) has a unique feature, case independent file name matching (which can only be enabled when the filesystem is created).
This of course is not POSIXly correct, but it can be a (rather large) advantage for filesystems exported via Samba.
The UDF filesystem crashed on deletion; also mkudffs supports blocks sizes of 1KiB, 2KiB and 4KiB, but the udf system module only supports 2KiB.
The ReiserFS and Reiser4 were not tested at all.

Some opinions:

ext3 for a small desktop machine seems the best bet overall, and the choice is between a somewhat faster 4KiB block and a rather more space efficient 1KiB version. Considering the availability of a lot of tools (including MS Windows drivers) for ext[23] that are not available for other filesystem types, this impression is reinforced (050917 update).
The one possibly large weakness of this filesystem is that directories are linearly scanned by default. There is now an option to have them as trees, but if directory size is an issue then I'd suggest JFS.
UDF has been included as an oddity, for one thing it is not journaled; but then it is log based, which gives largely the same advantages.
UDF amazingly is the smallest in code and still puts in a fairly credible performance. It is slow at writing, but probably this could be optimized; the UDF filesystem code feels less mature than the others.
It is utterly amazing to me that the code for the XFS filesystem amounts to half a megabyte.
JFS and XFS are a bit slower (but still quite reasonable), but other tests show they perform much better under highly parallel setups. Of the two JFS is slightly slower in elapsed time, and XFS has much higher CPU overheads. Those other tests show that XFS scales a bit better. I would still prefer JFS in most cases because of smaller code and because at high throughput rates higher CPU overhead could matter a lot more.

Overall my preference goes to ext3 with 1KiB blocks (still fast, saves a fair bit of space), or to JFS for more demanding environments or for filesystems exported with Samba (parallelizes well, good features, low CPU usage).

050907

I have been trying out the various elevator algorithms on my little desktop machine, and reading about them, as well as trying various filesystems, and my approximate conclusions so far are:

As to filesystems, ext3 for most configurations (especially desktops, in particular if dual booting with MS Windows), but JFS for large partitions and/or for systems with several processors and RAID with many discs, and perhaps switch to XFS for really large numbers of processors and disc arrays (051003 update).
As to elevators, cfq for desktops as it minimizes latency, as when throughput is more important (but is not good on subsystems with many heads, like RAIDs), and the deadline elevator for DBMSes (best for random access patterns); noop is good for storage subsystems with their own intrinsic scheduling.

The use of cfq in particular has been useful for me to reduce the hogging of the disc by particularly large disc operations, like installing package or filesystem scans, that would make most other processes rather unresponsive with as for example.
However each of these elevators has blind spots. For example I use a multiprocess piping program to clone some of my partitions to a backups hard disc every night, and while with as which favours large sequential transfers it does 20-25MiB/s, with cfq it does only 4-6MiB/s if it runs as 3 processes, returning to 11-12MiB/s if run as 2 processes (050916 update). Pretty amazing.

050906

I was assisting somebody asking what kind of filesystem to use for a small network storage server with a small RAID array, and then I got asked about various Linux filesystem tradeoffs. My take is more or less this:

The ext2 usually has awesome performance for almost anything, but does not journal, so bad news for large filesystems.
The ext3 has pretty good performance across the board except that since it uses kernel based coarse locking (050916 update), as locking being particularly essential for the journal, it does not scale well to highly parallel hardware and process configurations (presumably because locking the journal becomes a bottleneck, as ext2 scales well).
It is probably by far the best for desktop style computers (050917 update); for larger systems it can result in very long fsck times in part due to having index based directories disabled by default (which is right, because ext3 is designed to be simple, and index based directories are sort of unnatural for it, even if they are now available).
JFS and XFS both seem rather more scalable, because they are designed for high degrees of both internal (their own fine grained locking mechanism) and external (ability to work with several drives) parallelism.
XFS seems even more scalable than JFS, but at lower levels of scale the advantage is eaten by the high overheads of XFS. XFS is also awesomely complex code, and even if it is mature, well tested code, that worries me.
Elevator choice can have a truly dramatic impact on performance even greater than filesystem choice.

050905

As to ridiculous memory usage by Konsole here is my latest:

 USER     PRI  NI  VIRT   RES   SHR S CPU% MEM% Command
 pcg       15   0 64192 38788  8120 S  0.0  2.6 kicker [kdeinit]
 pcg       15   0 62412 38960  7076 S  0.0  2.6 konsole [kdeinit]
 pcg       15   0 60752 37108  7160 S  0.0  2.5 konsole [kdeinit]
 pcg       15   0 46908 27868 10480 S  0.0  1.9 konsole [kdeinit]

and this is what is was just after startup a few days before:

 USER     PRI  NI  VIRT   RES   SHR S CPU% MEM% Command
 pcg       15   0 34064 17656 13520 S  0.0  4.3 kicker [kdeinit]
 pcg       15   0 33200 17072 12508 S  0.0  4.1 konsole [kdeinit]
 pcg       15   0 32460 16124 12292 S  0.0  3.9 konsole [kdeinit]
 pcg       15   0 32452 16116 12292 S  0.0  3.9 konsole [kdeinit]

This is just ridiculous and sick: well, each konsole process has a few tabs open, but has grown to a resident set of several dozen megabytes is just utterly sick (never mind the over 60MB of reserved total memory), even if almost a dozen is shared KDE libraries. And KDE is not as bad as others...

050904

Rather interesting observation from an interesting comparative test of various DVD rewritable media:

What we did witness and it seems to be the case with most of the media we tested, is that they all need a couple writes/erasures in order to "settle in" after which we had lower levels and fewer errors.

This is not a big problem, because rewritable discs should usefully fully formatted and written over before use, in part to ensure they are good, in part to initialize them even if DVD+RW can format incrementally; and for DVD-RW formatting is pretty much essential, as by default they come unformatted and in incremental sequential mode, while they should be for maximum convenience be in restricted overwrite, and as the overwrite says, they must be first written to become fully randomly rewritable.
So what I do is to format fresh discs, for example with

dvd+rw-format -force=full /dev/hdg

and then fully write them over with noise data, that is data that has been encrypted with a random password, using a script like this:

#!/bin/bash

: "'stdin' encrypted with a random password"

case "$1" in '-h')
  echo 1>&2 "usage: $0 [ARGS...]"
  exit 1;;
esac

if test -x '/usr/bin/aespipe' -o -x '/usr/local/bin/aespipe'
then
  NOISE_KEY="/tmp/noise$$.key"
  trap "rm -f '$NOISE_KEY'" 0
  dd 2>/dev/null if=/dev/random bs=1 count=40 of="$NOISE_KEY"
  exec 3< "$NOISE_KEY"
  aespipe -e aes256 -p 3
else
  export MCRYPT_KEY MCRYPT_KEY_MODE
  MCRYPT_KEY="`dd 2>/dev/null if=/dev/random bs=1 count=32 \
	       | od -An -tx1 | tr -d '\012\040'`"
  MCRYPT_KEY_MODE='hex'
  exec mcrypt -b -a twofish ${1+"$@"}
fi

On the general subject of rewritable quality and reliability, rewritables are based on phase change recording layers, and these apparently decay with time, so they are not suitable for long term archiving.
I have also been writing some notes on how to use packet writing under Linux as recent DVD+RW and DVD-RW are particularly suitable for it.
As to this, very funny news about 16x DVD-RAM drives and media:

Another issue is that the new 16x DVD-RAM media do not support a high overwriting cycle, which means that the discs will perform the best before 10,000 overwrites (100,000 for the 1x, 2x, 3x media).

What is so funny is that DVD+RW (and also DVD-RW to some extent) already supports full random access operation which was one of the two main features distinguishing DVD-RAM, and its main remaining difference with DVD-RAM was that it supported only 1,000 rewrite cycles.
In other words, it may be that the new DVD-RAM standard is really just a rebadging of DVD+RW, with a somewhat higher rewrite cycle.

050903

I have been compressing and encrypting my backups with gzip -2 and mcrypt and it turned out that the latter was many times slower than aespipe. After reading a comprehensive article about time and compression tradeoffs for several compressors I decided to have look and experimented with lzop which has a good reputation for being particularly suitable to backups, being fast and still offering good compression.
Indeed, as a test compression on my Athlon XP 2000+ CPU of a tar stream of my 3.47GiB home dir (mostly text, but lots of photos too) shows:

`lzop` vs. `gzip -2`
	gzip -2	lzop
CPU user	372s	126s
CPU system	15s	22s
output	2.57GiB	2.69GiB

It looks like that lzop is a clear winner here, even if there is a small, expected, increase in the size of the compressed output. The reduction in CPU time cost allows for higher output speed given the same CPU, as CPU time spent compressing plus encrypting almost make backups CPU bound here.

050902

I am always impressed by the power of the Aptitude dependency manager, in particular as to fine control (despite its numerous warts and limitations) in particular its filtering. As I still (why?) use Debian, tracking packages in its various editions, I am trying to avoid for now getting on with the several ABI changes being introduced. Aptitude allows me to exclude from display all packages that depend on either libc6-2.3.5 or packages that depend on them, with this filter expression:

~D(libc6~V2\.3\.5|!~D(libc6~V2\.3\.5))

which is slow but fairly impressive.

050901

Extremely peculiar statement by a developer who knows his stuff about the relationship among the USB chipset drivers in the Linux 2.6 series kernels:

Use uhci_hcd or ehci_hcd, but never both at the same time. ehci_hcd will work with all lo-speed ports, so uhci_hcd is then no needed.

Ahhh, thats interesting, and poses a somewhat irritating conundrum, at least for me. I have compiled all three of the USB HCD drivers (UHCI, OHCI, EHCI) in my kernel binary, not even as modules, to get univeral support without the need for an initrd. Will the EHCI driver support a pure UHCI chipset? Time will tell.
On a completely different note, a startup by Erik Troan and others will use a new package repository system called Conary.

August 2005

050831

Well, while playing around with lsof and I noticed that Konqueror had opened and mapped the Arial Unicode MS TrueType font, which is about 24MiB large and has lots of glyphs.
In itself this is not tragic; since this is now done with mmap(2) if many processes open and map the same font the font will be read into memory only once, just like with a font server. But when a font is read into memory it also involves per process data, and with a font server this only happens once.
And on the subject of memory consumption, very revealing and fascinating discussion in a RedHat issue entry; the revelation is that in order to minimize locking among threads, by default the GNU LIBC allocator creates not only a largish stack region per thread, but also a largish 1MiB initial heap arena per each thread, which also needs to be naturally aligned. This results in a fair degree of address space fragmentation, and probably also in some data space wastage.
Among the various fascinating details, this one amazes me:

> Alloced memory amount grew to ~2266 Mb (from 1433 Mb
> before) but allocation speed dropped significantly
> (several times).

Were you building the i686 glibc?  I.e. rpmbuild --target
i686 -ba -v glibc.spec?

which seems to indicate that when compiling for the 686 architecture some instructions give a large speedup, my guess is that is about cheap locking.

050830

Well, I have been investigating a bit the gamin server on which KDE depends for getting notified of filesystem changes (mostly, but not only, to refresh the lists of files when a directory is open in Konqueror).
I got interested in it as after backing up to an external FireWire/IEEE1394 hard drive I could not remove the sbp2 module as it was in use, but no other module was using it. So I started suspecting various monitoring facilities, and looked at gamin because it has a bit of a reputation for keeping things open. So I discovered that it cannot be disabled or removed, but one can configure it to choose polling or kernel notifications (polling is slower but safer) by path, or the same or to disable it entirely on some file system types.
On a totally different note, just discovered a really fascinating set of notes by a Sun engineer who attended the OLS 2004 scalability session both as to the numbers for Linux 2.6 and as to the worries he has about Solaris scalability. Random quote:

stop using cmpxchg on multiproc systems, doesn't scale.. 10x slower on 4p, 100x slower on 32p.

050827

Rather baffling discovery while prefilling media like swap partitions or DVD-RW with noise (pseudorandom stream encrypted with a random key, which seems to be rather noise like): it turns out that AES (Rijndael-128) encryption using mcrypt is many times slower than using aespipe which uses the excellent AES implementation by Dr. Brian Gladman.

050826

After listening to the Kamaelia talk at the FAVE about the BBC R&D developing a massive, massively parallel streaming system, little suprise to read that the BBC plans to broadcast programmes onto IP streams as well as over the EM spectrum. I wonder whether this means that UK internet users (to which the webcasts will be restricted) will be required to pay some kind of license fee (I don't have a television as I don't have much time and/or enjoyment watching it).

050825

Ha, another masterful example of programming: when I delete or move a bookmark in the KDE Konqueror bookmark editor, that takes around 4 seconds of CPU time on an Athlon XP 2000+. Why? Well, that's obvious: because they can!
More in detail, I have a few thousand bookmarks, and it looks like that operation on a bookmark have a cost proportional to the number of bookmarks, as if the whole bookmark tree were re-laid-out on ever update. And why not, if the goal is to look good on a demo with a dozen bookmarks?

050825

Considering my previous rants about memory misuse in many programs, I was rather amused by this article:

A Poltergeist in My Plasma TV
LG's $5,000 set worked for a month. Then things got weird as the unit developed a disturbing "memory leak"

050823

The lead programmer for the OpenBSD project has decided to deal with long standing memory misuse patterns via better allocation in both the kernel and the C library. He well realizes that this will break backwards compatibility with many buggy programs:

we expect that our malloc will find more bugs in software, and this might hurt our user community in the short term. We know that what this new malloc is doing is perfectly legal, but that realistically some open source software is of such low quality that it is just not ready for these things to happen

I admire his determination, but it is quite brave. Also, while some open source software is of such low quality, some proprietary software is even worse, and MS Windows allegedly contains deliberate support for bugs without which many important proprietary packages no longer seem to work properly.

050822

More grief and upset because of the merry t*ssers inflicting their incredible astuteness on the poor X server.
My issue as before is that I would like to specify X fonts in XLFD format with a screen DPI independent format, for example

-adobe-courier-medium-r-*-*-*-100-0-0-*-*-iso8859-*

that is 10 points (100 decipoints) with the horizontal and vertical DPI defaulting to those returned by the X server.
Note that specifying those DPI as * instead of 0 has completely different semantics, and selects the first (more or less random) DPI actually available. Well, in the file dix/dixfonts.c, procedure GetClientResolutions, there is this marvelous bit of code:

/*
 * XXX - we'll want this as long as bitmap instances are prevalent
 * so that we can match them from scalable fonts
 */
if (res.x_resolution < 88)
    res.x_resolution = 75;
else
    res.x_resolution = 100;

which forces (clumsily) the DPI to be either 75 or 100, regardless of the actual screen DPI, and whether there are fonts for the actual screen DPI available. Now virtually all 15" 1024x768 LCD monitors are 85DPI, and I have created a nice font.alias that does define 85DPI versions of all the bitmap (PCF) fonts with the proper size (roughly). But these get ignored, and I get the 75DPI bitmap fonts instead.
The code above looks like an attempt to violate the X font specs to work around a common issue in the shoddiest possible way; the common issue is that if there is no font at the proper DPI, the X server should return no font available, or scale a bitmap font forcibly to the right DPI, which causes trouble or uglyness. Since the X server computes the DPI from the declared screen size in millimiters and pixels, unusual DPIs can easily result.
Well, yes but the correct procedure is to select the font with the nearest DPI, and if nothing near be available, rescale, and if that is not possible, fail, not to preempt shoddily any availability of fonts (or font names) with the right DPI that might be present.
I had fondly but erroneously believed that since the freetype font module can handle bitmap (PCF) fonts too it might be used to work around this, but this is not possible because the bitmap module is loaded not just forcibly but it is also loaded first (not last, as a default), and preempts any subsequent font modules from handling bitmap fonts, and not viceversa; in any case the forcing of the DPI is the server itself, not in one of the font modules.

050821

I have been looking at my complicated (don't try this at home kids!) Debian APT setup, which is complicated because it references several repositories from different distributions (Ubuntu, Debian, Libranet, ...) and from different editions (Debian Sarge, Etch and Sid, Ubuntu Hoary and Breezy, ...).
The /etc/apt/sources.list was changed to reference Debian editions not by level (unstable, testing, stable plus experimental which is a state more than a level) but by name, as a I finally decided, after many months of indecision, that tracking specific editions is a lot safer than tracking a state (which must be why Ubuntu does use only the edition names); especially when, as now, the testing and unstable levels are very incomplete and inconsistent just after the release of a new edition of Debian, and as major ABI transitions are in flux.
Therefore I had another look at my APT configuration and pinning preferences which I revised. I also added some entries to assign priorities to packages that contain in their version string an edition or distribution name (something not that clean, but it happens a lot), and I got a stupid warning, motivated by this inane observation:

You can't use 'Package: *' and 'Pin: version' together, it's nonsensical.

Well, in theory, the versions of two random packages are not related; but since people (including the official Debian distribution maintainers) encode edition and distribution names in the version string, in practice it can make a lot of sense, as for example in:

Package: *
Pin: release o=Debian,a=sarge
Pin-Priority: 990

Package: *
Pin: version *sarge*
Pin-Priority: 990

which is too bad.

050820

Quite interesting presentations at the Bristol Linux media FAVE. I am writing mostg of these notes as the event progresses, thanks to a very convenient WiFi cell set up by Bristol Wireless who have also set up 20 public access workstations on the same wireless cells using the LTSP package.

CSound

Quite impressive free music synthesis system. It is quite ancient, but it is still being improved. It has two programming languages, one high level, one low level, and astonishingly musicians seem to be not bothered by the latter, which is assembler-level.

Kamaelia

The issue is massive multimedia streaming, with a 10 year horizon.

BBC currently delivers 10-50,000 concurrent streams, 1-6 million streams served per day.
With RealAudio they have to pay per stream licensing costs, and 1,000 times growth anticipated.
Mass distribution: P2P and multicast. P2P starts slow, but multicast can be used to seed it much more quickly.
CPUs are not growing in speed as fast as storage and network speeds.
New codecs (like Dirac) needed for growth.
Concurrency is the future, and it is easy.
BBC is a broacaster, more precisely a program maker, however we have to develop technology. Making technology available helps with feedback and standards.
Real has been used only because it was most scalable and platform independent at the time it was chosen, even if proprietary.
Kamaelia is a componentized, based on an architecture with scheduler and other infrastructure, and then data flow. It has a plugin framework, and a nice GUI frontend for the pipeline/graphline.
Concurrency can be tamed with read only stuff or single reader/single writer pipeline/graphline.
All written in Python, with growing C/C++ bits.
It is not very mature yet, so for production you may want to look into Fluendo's Flumotion which is based on the Twisted framework/ and GStreamer.
It has been released as free software to expose it to actual usage and garner feedback.

The Rosegarden sequencer and Studio To Go!

Nice tutorial on how to use these MIDI tools.

Access Space

Access Space is an open lab for creative digital activities in Sheffield. It has been set up for creative artists (a euphemism for unemployed people, the speaker said) and they have both availability of free software based technology and courses and seminars.
Studio To Go! is a distribution packaged with smoothly working GNU/Linux audio applications, a bit like aGNUla's DeMuDi; a couple of users present said that they preferred it even to DeMuDi.

A presentation on open source film making

I started to be a bit busy with user enquiries, so I was not following that much. From what I could glimpse the two themes were that free software is one of the enabling technologies for a lot of grassroots film making. I was reminded of the StarWars Revelations movie, which is a pretty good example.

After the talks there were a few sessions of GNU/Linux aided music, including some (angry-sounding) singing by RachelAPP (formerly Natalie Masse, described here) and some pleasant synthetic music later on. But I was very distracted with handing out installation or live CDs, and helping people with various GNU/Linux and sound problems, which went on for a while, so I left the event much later than I had expected, but it was fairly worthwhile.

050816

Curious moves in the distribution world: just as RedHat have decided to create a Fedora Foundation some time ago (but they haven't actually done so yet), NOVELL have decided to create an OpenSUSE project (and organization?) to mimic Fedora. The perplexity is still on how seriously either corporates are prepared to let go, considering the ancient Mandrake fork of RedHat; which after all barely dented, if at all, RedHat's market position.

050815

From a report about the Ottawa Linux Symposyum 2005 some moderately startling observation about the relative popularity of Novell's SUSE and RedHat, where Novell seems to be increasing in popularity and RedHat becoming a lot less visibile. This may be due to Novell having a much bigger distribution channel, from the times where they seemed to be the networking suite quasi monopolists.

050814

While looking at linking technologies for a friend, in part because of interface linking hell, interesting discovery of a new static library idea that claims to have most of the advantages of shared libraries.

Software and hardware annotations q3 2005

September 2005

August 2005

July 2005