Notes about Linux filesystems

File system references (170414)

Older references are not quite accurate, because things in kernel 2.6 are quite better than in kernel 2.4 and filesystem maintainers have reacted to older unfavourable benchmarks by tuning their designs. So the references below are ordered by most recent first.

General

The ext2 page.
The ext3 page and the ext3 mailing list.
The ext4 page.
The JFS project and the JFS mailing list archive.
The XFS project and the XFS mailing list archive.
The ReiserFS page.
The Reiser4 page.
F2FS.
Bcachefs.
OpenZFS 2017-03-16.
Comparison of filesystems 2005-08-26.
Linux Filesystem Overview 2005-08-02.

Descriptions (20210305)

Troy Curtis Jr Choose between Btrfs and LVM-ext4 2020-12-30.
Don Brady ZFS: Using allocation classes 2017-03-20.
Adam H. Leventhal A ZFS developer’s analysis of the good and bad in Apple’s new APFS file system 2016-06-26.
SUSE 3.4.2 Support for the Btrfs File System 2016-04-16.
Dan Luu Files Are Hard 2015-12-12.
John Goerzen Results with btrfs and zfs 2013-12-07.
Jonathan Corbet XFS: the filesystem of the future? 2012-01-20.
Steven Sinofsky XBuilding the next generation file system for Windows: ReFS 2012-01-16.
Dave Chinner XFS: Adventures in Matadata Scalability 2012-01-18
Edited by Don Domingo and Laura Bailey Advanced tuning for XFS 2011-12.
Hartmut Reuter AFS cell "ipp-garching.mpg.de" 2011-09-22.
Goldwyn Rodrigues A look inside the OCFS2 filesystem 2010-09-01.
Jonathan Corbet Solving the ext3 latency problem 2009-04-14.
Hartmut Reuter and others OpenAFS + Object Storage 2008-05-22.
Avantika Mathur Improving fsck Speeds in ext4 2007-09-18.
FUSEWiki - Filesystems.
Joo-Young Hwang A Reliable and Portable Multimedia File System 2006-07-18.
Val Henson Shortening fsck Time on ext2 2006-07-18.
Olaf Kirch Why NFS Sucks 2006-07-18.
Mark Fasheh OCSF2 2006-07-18.
Theodore Ts'o Proposal and plan for ext2/3 future development work 2006-06-28.
Corbet Ext3 for large filesystems 2006-06-12.
Val Henson A Brief History of UNIX File Systems 2005-05-17.
Jeremy Andrews and others Interview: Hans Reiser 2005-09-13.
Mingming Cao and others State of the Art: Where we are with the Ext3 filesystem [paper] 2005.
Mark Feldman Advanced Linux File Systems 2005-05-10.
Andreas Jaeger Large File Support in Linux 2005-02-15.
Juri Haberland Linux ext3 FAQ 2004-10-14.
Rajesh Fowkar EXT3 File System mini-HOWTO 2004-04-23.
Brett Cooper Linux filesystems, 2003-08.
Sanjay Ghemawat Howard Gobioff Shun-Tak Leung The Google filesystem, 2003-10.
Cristoph Hellwig XFS for Linux, 2003-08-05.
Richard Menedetter Journaling filesystems for Linux 2002-12-15.
Theodore T'so Planned extensions to the Linux Ext2/Ext3 filesystem 2002-06.
Ricardo Galli Granada Journal File Systems in Linux 2002-02-24.
Steve Best JFS for Linux 2002-02.
Daniel Robbins Introducing XFS 2002-01-01.
Eugenia Loli-Queru Interview With the People Behind JFS, ReiserFS & XFS 2001-08-28.
Michael Johnson RedHat's new journaling filesystem: ext3 2001.
Martin Hinner Linux Filesystems HOWTO 2000-08-22.
Stephen Tweedie EXT3 journaling filesystem, 2000-07-20.
Juan Santos Florido Journal Filesystems 2000-07.
Steve Best and Dave Kleikamp JFS layout 2000-05-01.
Steve Best JFS overview 2000-01-01.
Stephen Tweedie Journaling the Linux ext2fs Filesystem 1998.
Adam Sweeney and others, Scalability in the XFS File System 1996-01.

Benchmarks

Warnings: many of these benchmarks not only are designed somewhat naively, some truly essential aspects of the context, like the elevator or the filesystem readahead, are not mentioned; benchmarks under Linux 2.6 can give very different results from under Linux 2.4; SCSI and ATA/IDE disc drives have very, very different performance profiles, including sync reporting.

Yaroslav Halchenko git-annex-centric benchmark of the filesystems 2016-07-06 for Btrfs, ext4, ReiserFS, XFS, ZFS.
Michael Larabel Real World Benchmarks Of The EXT4 File-System 2008-12-03 (and comments).
Dave Chinner and Jeremy Hingdon Exploring High Bandwidth Filesystems on Large Systems [slides version] 2006-08-09.
Ard Biesheuvel Effects of Filesystem Fragmentation 2006-07-18.
Dave Chinner and Jeremy Hingdon Exploring High Bandwidth Filesystems on Large Systems [document version] 2006-07-18.
Mingming Cao and others State of the Art: Where we are with the Ext3 filesystem [slides] 2005.
Lars Rasmussen Linux File Systems Comparisons/Benchmarks 2005-07-19.
Hans Reiser Benchmarks Of ReiserFS Version 4 2005-06-17.
Rufus 2.6FileSystemBenchmarks 2005-06-07.
Sxooter JFS and ext3 are generally the fastest under a database, while XFS and Reiser seem to be pretty slow. 2005-05-11.
anonymous XFS's real talent is hidden in recovery 2005-05-11.
Fallow Filesystems comparison for present time (r4,r3,jfs,xfs,ext3) 2005-04-12.
Steven Wilton Linux filesystem speed comparison 2005-04-10.
Grzegorz Kowal fsbench 2005-01-25.
kwijibo {at} zianet.com The internet of tomorrow today 2004-11-07.
Justin Piszcz Benchmarking Filesystems 2004-06.
Bruce Guenter Benchmarking Maildir Delivery on Linux Filesystems 2004-05-14.
Mary Meredith and Duc Vianney Linux 2.6 performance in the corporate datacenter 2004-01.
Scott Kveton ReiserFS v. ext2 v. ext3 v. JFS v. XFS, 2003-04-03.
Mike Benoit Linux File System Benchmarks 2003-10-28.
Grant Miner and Szakacsits Szabolcs Filesystem Tests, 2003-08-06.
Randy Dunlap Journaling filesystems for Linux 2002-08-15.
anonymous Ext3 vs Reiserfs 2002-07-11.
Bert Scalzo Tuning an Oracle8i Database running Linux, Part 2: the RAW Facts on Filesystems 2002.
Jim Bray Linux Filesystems Comparison: ext2, ext3, xfs, Reiserfs 2001-12-25.

Online discussions

Warning: some of these discussions are listed here because I think that they are notably wrong. Some pointers are to single articles, some to threads.

Large FOSS filesystems.

Desktop filesystem features
Feature	`ext3`	JFS	XFS
Block sizes	1024-4096	4096	512-4096
Max fs size	8TiB (2⁴³B)	32PiB (2⁵⁵B)	8EiB (2⁶³B) 16TiB (2⁴⁴B) on 32b system
Max file size	1TiB (2⁴⁰B)	4PiB (2⁵²B)	8EiB (2⁶³B) 16TiB (2⁴⁴B) on 32b system
Max files/fs	2³²	2³²	2³²
Max files/dir	2³²	2³¹	2³²
Max subdirs/dir	2¹⁵	2¹⁶	2³²
Number of inodes	fixed	dynamic	dynamic
Indexed dirs	option	auto	auto
Small data in inodes	no	auto (xattrs, dirs)	auto (xattrs, extent maps)
`fsck` speed	slow	fast	fast
`fsck` space	?	32B per inode	2GiB RAM per 1TiB + 200B per inode (half on 32b CPU)
Redundant metadata	yes	yes	no
Bad block handling	yes	`mkfs` only	no
Tunable commit interval	yes	no	metadata
Supports VFS lock	yes	yes	yes
Has own lock/snapshot	no	no	yes
Names	8 bit	UTF-16 or 8 bit	8 bit
`noatime`	yes	yes	yes
`O_DIRECT`	yes	yes	yes
`barrier`	yes	no	yes (and checks)
commit interval	yes	no	no
EA/ACLs	both	both	both
Quotas	both	both	both
DMAPI	no	patch	option
Case insensitive	no	`mkfs` only	`mkfs` only (since 2.6.28)
Supported by GRUB	yes	yes	mostly
Can grow	online	online only	online only
Can shrink	offline	no	no
Journals data	option	no	no
Journals what	blocks	operations	operations
Journal disabling	yes	yes	no
Journal size	fixed	fixed	grow/shrink
Resize journal	offline	maybe	offline
Journal on another partition	yes	yes	yes
Special features or misfeatures	In place convert from `ext2`. MS Windows drivers.	Case insensitive option. Low CPU usage. DCE DFS compatible. OS2 compatible.	Real time (streaming) section. IRIX compatible. Very large write behind. Project (subtree) quotas. Superblock on sector 0.

File system hints

This section is about known hints and issues with various aspects of common filesystems. They can be just inconveniences or limitations or severe performance problems.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7a6e59d719ef0ec9b3d765cba3ba98ee585cbde3

File system hints generic (20210419)

Release dependent:

Linux 5.12: added id-mapping for files.
Linux 5.10: F2FS 5.10 has a file-corrupting bug for inline data that needs a patch.
Linux 4.10: F2FS support for zoned (SMR host managed) HDDs.

File system hints for F2FS (20221102)

Support for variable length inline extended attributes added in 2017.

File system hints for JFS (121226)

Support from TRIM only from kernel version 3.7 and later.
No support for barriers; but the flush interval to the journal is very short.
JFS in kernel version 2.6.8 has a singificant memory leak.
JFS can handle bad blocks only when a filetree is created, additional ones cannot be handled.
The journal can be disabled for fast writing, but disabling the journal is not safe and should not be used for anything other than reloading backups.
Since each growing extent is allocated space from an allocation group, and each allocation group will only allocate space to a single extents, if the number of extents (or files) being grown is greater than that of allocation groups some processes will block.

File system hints for `ext3` (120304)

An ext3 filetree can get very fragmented.
Creating a large filetree can take a long time to initialize block groups.
Timestamps (including modification time) have a granularity of 1 second, which means that multiple updates per second are not recorded. This can impact make processing and fsync.
When fsync is issued all outstanding updates in the journal are written out (in some popular cases), making it a very expensive operation.
Flushing for memory to disk is every 5 seconds by default, which can be too frequent, and can mask the lack of fsync in applications.
Older kernels (such as in RHEL5) support only filetrees with 128B inodes, but newer tools create filetrees with 256B inodes by default.
Older versions of GRUB can only read filetrees with 128B inodes.
Maximum filetree size is 8TiB.
Support for only 32k subdirectories in a directory.
Support for only 32k hard links to a file.
Support and TRIM and FSTRIM only from kernel version 2.6.36.
Checking a damaged filesystem can take months.
Directory indices or ACL blocks can be allocated away from directory data and lead to terrible performance.

File system hints for `ext4` (120922)

In-place conversion from ext3 leaves existing files allocated as they were.
Timestamps (including modification time) have a granularity of 1 second if the older ext3 compatible inode size of 128 bytes is used or kept. If so, multiple updates per second are not recorded. This can impact make processing and fsync.
Flushing to disk is much less frequent than for ext3, which increases the chances of data loss unless barriers are enabled and fsync is used by applications.
It is possible to create ext4 filetrees larger than 8TiB but this requires a recent kernel with a page cache that supports that.
It is possible to create ext4 filetrees larger than 16TiB but this requires not just a recent kernel but also mke2fs 1of 1.42 or newer.
Online resize of ext4 filetree from less than 16TiB to more than 16TiB is possible from kernel release 3.3 but only if the filetree was created with 64 bit offsets.
Since kernel version 3.5 ext4 metadata is optionally checksummed.

File system hints for XFS (140925)

Kernel version dependent hints:

3.17 fixes a long standing bug that means that by default when a filetree space is increased any newly created allocation groups are not used for inode allocations. A workaround is to remount with first with option inode32 and then again remount with option inode64.
3.4.5: fixes a delayed segment allocation issue.
3.2.x: bug that causes needless wakeups in xfsaild.
3.2.x: Bug of double unlocking of the ilock.
3.2.12 and older: The default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.
2.6.39 and newer: Support for TRIM and FSTRIM.
2.6.37: There a known bug in the VFS which impacts XFS.
2.6.35 and older: Mounting a filesystem that was mounted with inode64 without it can cause problems.
2.6.32 and newer: The su and sw parameters are obtained automatically from Linux MD, with xfsprogs 3.1.1 or newer.
2.6.27: Case insensitive filenames.
2.6.27.3: Fixes for some regressions.
2.6.21 and older: On crash a file can have nulls inside.
2.6.17 (point versions up to .6): Bug that leads to directory corruption.
2.6.17 and newer: Barriers are enabled by default.

Kernel version independent hints:

Since the superblock is at sector 0 of the filetree volume, one cannot have a partition boot loader or other metadata in the same volume.
XFS does not handle bad blocks at all.
Applications that don't issue fsync can get files full of zero blocks if there is a crash. This is not an issue with XFS, but with the applications.
In 32b mode it is possible to use a filetree that is larger than fsck can repair.
In 32b mode or without the inode64 option inodes will only be allocated in the first 1TiB (if the sector size is 512B) of space.
In 32b mode with 4KiB kernel stacks there is a strong possibility of stack overflow, and a near certainty if exported by NFS.
In 64b mode kernel stacks are 8KiB by default, but there is still some possibility of stack overflow with NFS exports, especially on top od DM/LVM2.
The inode64 option rotors directories across AGs, and then attempts to allocate space for new files in the AG containing the directory, which is quite different from the alternative because if you create a bunch of files in the same directory, without inode64 XFS will scatter the extents all over the disk rather than trying to allocate them next to each other.
When a file is opened the entire list of extents is loaded and kept in memory, which can consume a lot of memory for highly fragmented files.
Larger inode sizes can store more extended attributes and inode extents maps for greater efficiency.
Up to 64KiB of extended attributes are supported.
Filetree freezing hangs all applications accessing files in that filetree.
It is possible to move an internal journal to an external journal, but not viceversa, unless the filetree had an internal journal to start with.
Since metadata access is serialized by allocation group, if all allocation groups are in use to grown extents writing can stop for all other files, or similarly if the files are in the same allocation group. Having more allocations groups typically improves multithreaded performance.
If the XFS allocation group size is a multiple of the underlying RAID stripe then the allocation groups and (and their metadata) may end up on the same disks, preventing parallel IO across the stripe. If mkfs.xfs can discover the underlying RAID geometry it will warn about this with the message:
Warning: AG size is a multiple of stripe width. This can cause performance problems by aligning all AGs on the same disk. To avoid this, run mkfs with an AG size that is one stripe unit smaller, for example %llu.
The solution indicated is to manually specify an allocation group size that is not congruent with the stripe width, usually a bit smaller.
Configuring an external journal will disable XFS' write barrier support.

File system hints for Btrfs

Inside the Btrfs notes page.

File system hints for ZFS (201213)

Release independent:

The arecord must match the largest record size.
Vdevs cannot be changed or expanded or deleted.
All IO is done by at least a full recordsize, regardless of how much is actually read.
Removing snapshots can be very, very slow.

Release dependent:

ZFS 2.0 (2020) has been renamed as OpenZFS and can have a persistent cache, the Zstd algorithm, and a PAM module for file based encryption.
ZFS 0.6.4 (2015) has redundant_metadata and it is set by default to all. It also has as other notable new features LZ4 compression and bookmarks.
In Ubuntu LTS 16, hot spares don't work.
In older FreeBSD versions spares can become accidentally UNAVAIL, to fix detach and reattach them.

File system hints for NFS (210203)

Release dependent:

NFS protocol versions up to NFSv3 only report timestamps, including modification times, with a granularity of 1 second, even if the exported filesystem has a finer timestamp granularity.
The Linux NFS client does not do incremental flushing, but it issues COMMIT packets with no range, that flush all pending writes.
NFS over TCP will not restart a half-open connection.
The NFSv4 exports can all be under the same directory, and that directory must be exported with fsid=0, and the exported filetree paths do not contain the name of that directory. But this is optional and exports can be done as separate filetrees too.
Using des-cbc-crc:normal keytab entries with newer versions of Kerberos may require editing /etc/krb5.conf.
NFSv4 ID mapping must be enabled, and the mapping domain must be explicitly set and exactly the same between client and host. It does not need to be the DNS domain, but conventionally it is.
NFSv4 ID mapping with Kerberos authentication only works if the undocumented variable Local-Realm is set and must be the same between client and host, and must be the name of the Kerberos realm.
When mounting with Kerberos authentication the name given for the server must be identical to that of the service principal for the server, and must be a canonical name of the server, that is the address it resolves to must resolve back to the same name, perhaps because of a bug in GNU LIBC which affects the Kerberos library.

Release dependent:

Fixed from 4.14 and 4.19: in some cases where programs rename an NFS file, under NFS 4.0 but not 4.1, the Linux NFS client does not revalidatea a file handle resulting in some programs reporting Stale file error.
In kernel 2.6.32 there is a bug in UDP offloading that causes freezes and corruption in NFS.
In some recent versions of the Linux kernel's NFS client autotuning can result in instability:
- a change in RHEL 6.3: sunrpc.tcp_max_slot_table_entries dynamically allocating RPC slots up to the maximum (65536).
- Reverting to previous limit of 128 recovered system stability.
Kerberos security export syntax must be done with a pseudo client of gss/krb5 or gss/krb5[ip] with kernel versions older than 2.6.23 or nfs-utils versions older than 1.11.
NFSv4 with Kerberos can only use des-cbc-crc:normal (also in RHEL) keytab entries in kernels older than 2.6.35 thanks to a massive update to the Linux GSS support module.
With kernel versions that only support des-cbc-crc enctypes if unsupported enctypes are used the GSS server dæmon will print debug message like:
```
prepare_krb5_rfc_cfx_buffer: not implemented
```

Summary of conditions for a working NFSv4 with Kerberos GSSAPI authentication and/or encryption:

The rpc_pipefs filesystem must be mounted on both fileserver and client, usually at /var/lib/nfs/rpc_pipepfs/ or /var/run/rpc_pipefs/ depending on distribution, and the Pipefs-Directory parameter in idmapd.conf must be set to that path.
The id mapping parameter Domain in idmapd.conf must be set to the same value on client(s) and server(s), and usually is the lower case version of the relevant DNS domain name.
The Local-Realm parameter in idmapd.conf must be set to the same value on client(s) and server(s) and must be the Kerberos realm name used for both.
ID mapping in the kernel must be enabled with a command similar to:
```
echo -n N > /sys/module/nfs/parameters/nfs4_disable_idmapping
```
The rpcsec_gss_krb5 kernel module must be loaded.
The file gssapi_mech.conf must list the gssapi_krb5 shared object with mechglue_internal_krb5_init as the initialization function.
On all involved systems the /etc/krb5.keytab must have the relevant host and service entries:
- The server must have in its keytab the keys for the nfs/ and host/ service principal for its canonical DNS name, unless the svcgss dæmon is configured otherwise.
- Each client must have in its keytab the keys for the host/ and nfs/ service principal for its own canonical DNS name, unless the gss dæmon is configured to run with the -n option, or the gss dæmon is a new one that can use just the host/ service principal.
The canonical DNS names for the machine should be all lowercase, and because of canonical name ambiguities using multiple IP addresses on a host may not work.
To sort out canonicalization issues the /etc/hosts file or the DNS zone may need to be carefully edited.
These dæmons should be running:
- The usual rpcbind (or portmap) plus idmap on both server and client.
- svcgss on the server, unless this is a recent versions of the NFS Linux utilities in which case it is not needed. Reading carefully the man page can be useful.
- gss dæmon on the client. Reading carefully the man page can be useful.
The exports file on the server must list the relevant filetrees as exported with one of the gss/ security types, and the exporting filetree must be mounted on the client using the same security type.
Each user accessing a NFS filetree mounted with Kerberos security must have their own Kerberos principal, the nfs/ principal for the client is used only to mount the filetree, but does not grant any user on the client access to the files.
If the mounting succeeds the Kerberos credential cache /tmp/krb5cc_machine_REALM will have a key from the local host's service principal and a key for the server's service principal.
The type of salt should be normal unless using a kaserver in which case it should be afs3 and v4 only if relying on a Kerberos4 KDC.
The Kerberos host and service keytab entries must have enctype des-cbc-src for kernel versions older than 2.6.35.

Some useful pages for using NFSv4 with Kerberos:

File system hints for OpenAFS (130913)

Version dependent:

From version 1.6.2 to 1.6.5 inclusive a bug in the new asynchronous IO code means that core dumps (and probably other kernel initiated IO) cannot be made to OpenAFS volumes.
From OpenAFS versions 1.4.15, 1.6.5, 1.7.26 it is possible to use Kerberos keys with better enctypes than des-cbc-crc and des-cbc-md5, and they should be used because 56-bit DES encryption is quite easy to break.
In OpenAFS up to and including 1.6.1 there is a bug with select and having more than 1024 file descriptors open that can cause memory corruption in the fileserver or the salvageserver; this often results in a hung salvageserver process and error messages like:
```
Wed Dec 18 08:52:47 2013 SYNC_getCom:  error receiving command
Wed Dec 18 08:52:47 2013 FSYNC_com:  read failed; dropping connection (cnt=8568)
Wed Dec 18 08:52:47 2013 SYNC_getCom:  error receiving command
Wed Dec 18 08:52:47 2013 FSYNC_com:  read failed; dropping connection (cnt=8569)
mode c1ff
6: dev 7, inode 25811, length 0, type/mode c1ff
7: dev 7, inode 9532, length 0, type/mode c1ff
8: dev 7, inode 21498, length 0, type/mode c1ff
9: dev 905, inode 95580989, length 65, type/mode 8000
10: dev 905, inode 94903997, length 2048, type/mode 8000
11: dev 905, inode 97388566, length 8448, type/mode 8003
```
in the FileLog.
The easiest solution is to put the line
```
ulimit -n 1021
```
in the script that starts BOS. The best solution is to upgrade to the latest release, as 1.6.1 has this and other known issues.
This bug affects the Debian 7/Wheezy packages for OpenAFS at least up to version 1.6.1-3+deb7u1. There are version Debian 7/Wheezy 1.6.5.1 packages in the wheezy-backports archive.
In OpenAFS version 1.6.0 a bug can lead to extreme overpinging of file servers.
AFS protocol encryption is available in OpenAFS from version 1.4.11 or from version 1.5.60, and probably it has always been in OpenAFS.
OpenAFS versions 1.6 and newer can put the cache on any filesystem as the relevant code has been rewritten. However it is useful to have the cache on a filetree without a journal, as the cache is ephemeral.
The client cache and the partitions up to and including OpenAFS version 1.4 must be on an ext2 or ext3 filesystem.
OpenAFS with Kerberos can only use des-cbc-crc:normal tickets and since version 1.2.11 it can also use des-cbc-md4 and des-cbc-md5 and using other may require editing /etc/krb5.conf.

Version independent:

OpenAFS terminology is sometimes different from common usage. In particular a volume is actually a subtree of AFS directories and files, and a partition that holds volumes is actually a subtree of some native operating system filesystem, whether the partition is on a fileserver or is the cache on a client.
OpenAFS file servers will use as a partition anything that is mounted under directories whose name begins with vicep in the system's root directory.
It is possible to use a file as the block device for a partition holding OpenAFS volumes, as long as it is mounted vi a loop device.
The partition for AFS volumes on an OpenAFS fileserver does not need to be in its own dedicated block device, and neither does the AFS cache filetree on an OpenAFS client, but out of space conditions caused by space in the filetree being less than that declared for the OpenAFS may be handled badly. The /vicepAB partitions which are not mount points will be however ignored unless they contain a file called AlwaysAttach.
AFS cell names are case insensitive but they are stored internally in uppercase and printed in lower case. As a rule by convention they should always bwe specified in lower case, as there are default mappings to case sensitive Kerberos realm names in all upper cases and to case insentitive DNS domain names in all lower case.
The afsio program cannot use dynroot because it relies on libafscp which does not handle synthetic roots.
OpenAFS uses UDP and implements a window style flow control algorithm similar to TCP, but the maximumwindow size is much smaller, which limits performance links with a large BDP. The protocol allows up to 256 outstanding packets, but versions of OpenAFS limit that for 32 packets, with the exception of the YFS version which allows for the full 256 packets.
The default network buffers sizes for OpenAFS fileservers are usually very inadequate:

So, setting a UDP buffer of 8Mbytes from user space is _just_ enough to handle 4096 incoming RX packets on a standard ethernet. However, it doesn't give you enough overhead to handle pings and other management packets. 16Mbytes should be plenty providing that you don't

a) Dramatically increase the number of threads on your fileserver
b) Increase the RX window size
c) Increase the ethernet frame size of your network (what impact this has depends on the internals of your network card implementation)
d) Have a large number of 1.6.0 clients on your network

To summarise, and to stress Dan's original point - if you're running with the fileserver default buffer size (64k, 16 packets), or with the standard Linux maximum buffer size (128k, 32 packets), you almost certainly don't have enough buffer space for a loaded fileserver.
Since a read-only replica will always be preferred to a read-write one, even if all read-only replicas are not available as long as one exists OpenAFS will not use the read-write one.
Therefore it is always a good idea if a volume has read-only replicas to create an additional read-only replica in the same partition as the read-write one, as that is essentially free as it does not require file copying.
It also is a good idea because it in case of release (updating read-only volumes to have the same content as a read-write volume) the read-only replica in the same partitions gets updated very quickly, and then other read-only replicas get updated from it, reducing the latency of the release operation.
Because of a (difficult to fix) bug it is very recommended to avoid having read-only and read-write replicas in different partitions on the same server because at boot it could happen that the partition with the read-only replica is the first to be discovered by OpenAFS and then the read-write replica is never attached.
Having read-only and read-write replicas of the same volume in different partitions on the same server is a design error, and there are checks against that, but some corner cases can be missed by the checks.
When reconfiguring an OpenAFS service special care must be taken when changing the database server with the lowest IP address.
When reconfiguring the addresses of OpenAFS DB servers the client caches must be restarted, or reset with fs newcell.
When changing the addresses of OpenAFS file servers there are important cautions concerning the use of fs changeaddr which is however rarely needed.
For various quorum related reasons the number of AFS db servers should be odd (1, 2).
It is possible to authenticate OpenAFS clients against multiple cells or as multiple users, because the client authentication cache can hold distinct AFS tokens; even if the Kerberos credential cache can only only hold those for one principal.

File system hints for Lustre (120307)

Not all Lustre releases are equally reliable, and choosing a good one for a specific environment can require some experimentation.
MGS, MDS and OSS can reside on the same system, and the same system can run those of several instances.
Mounting an OST on the OSS system that holds it can cause memory resource deadlock and is not recommended.
The MDS can be rather CPU bound.
The best RPC size for transfers is 1MiB, and aligning storage and user requests to 1MiB boundaries can give very large performance increases.
Operations that query or update the inodes, such as changing the size of a file, can be very slow, as all OSSes on which the inode is sliced must be contacted.
It is much better to use the Lustre own find command than the platform one.
Re-exporting a Lustre mount via SMB or NFS can give very poor performance.
Dual-linking Lustre servers on two separate network for Lustre-client and intra-Lustre communications can avoid a lot of problems.
LNET imposes some restrictions on changing IP addresses while the system is running.

Some of my notes on filesystems (120307)

These are pointers to some of the entries in my technical blog where filesystems are discussed:

150316 Contortions needed for effective use of some advising operations
120222b Log structured and COW filesystems
120222 Filesystem recovery and soft updates and journaling
120220 Code size as indicator of filesystem complexity
120218b A COW, snapshotting version of 'ext3' and 'ext4'
120128b Presentation on petascale filesystems
120123 Types of clusters and cluster filesystems
120120 OCFS2 with DRBD
120118 OCFS2 a nice filesystem with good performance
120111 Ambiguous filesystem terminology and change
120108 Good transfer rates on bulk file-tree copy
120104 Switching from JFS to XFS on my data file-trees
111230 Filesystems for SSDs
090508 Amazing filesystem news from Red Hat
090131 Impressive JFS and eSATA performance
080822 A new log structured filesystem design
080516 Large storage pools are rarely necessary
080417b Large storage pools and Lustre
080415 Dimensions of filesystem performance
080407 A cheap large reliable storage pool system
080406 Much improved filesystem checking for XFS
080216 A RAID and filesystem perversity
080210 Some more data on filesystem checking speed
070923b So the cases where RAID5 makes sense are...
070923 Yet another RAID5 perversity
070914 Another used filesystem test
070701b Disappointing Linux NFSv3 writing misfeature and workaround
070331c Check of a 5TB filesystem takes 12 hours
070127 More RAID5/RAID6 madness
061031b EMC2 often recommends RAID3
061031 Storage wire and command protocols, and SAN vs. NAS
061022b RAID5 perversions
061022 XFS etc. performance for parallel IO and fragmentation
061015 Options for mailboxing and tagged queueing
061014 Effect of elevator on multistream reading performance
061013 Evolution of a video-on-demand system
060914b Game load times, fragmentation; reporting to base
060729 Volumes and filesystem tags
060724b Partitions, extended partitions and 'ms-sys -p' on NTFS
060723 Now IO has priorities too under Linux
060702 The 'ext4' filesystem and RHEL
060625b Swap space misallocation in Linux
060514 Quick write speed test for NFS and CIFS
060513 Quick read speed test for NFS and CIFS
060510 Quick speed test for Reiser4
060424b Summary of fsck times
060424 Some larger filesystem informal speed tests
060423b Filesystem free space and fragmentation
060422b Disc-to-disc defragmenting and backups
060416 Retesting JFS performance over time
060323 comments on a report on the LKML that a bug in the disk queue managed in Linux means that writes to disk can be delayed a great deal.
060306b on performance degradation after 2 months of using JFS after a fresh install of Fedora 5.
051226b on performance degradation after 4 weeks of using JFS after a fresh install of Fedora.
051219 on my switch to ext2 for all my MS Windows filesystems except the boot one.
051204 on surprising speed test results with JFS and ext3 with and without extended attributes and ext3's new hash directory indices.
051127 on filesystems and partition size, and how large should partitions be.
051108 with comments on a filesystem with 4 million inodes in a 138GB partition, and fsck.
051101 on JFS speed degradation after 6 weeks of use.
051030b with some comments on the ZFS from Sun.
051014 on the filesystem usage of some popular apps that are slow to startup.
051012d with code example for advising for IO access patterns and their ineffectiveness under Linux.
051012 on several rather interesting threads in the XFS mailing list.
051011b on preallocating when overwriting.
051011 on filesystems, advising and preallocations.
051010 on large blocks sizes or fragmentation in filesystems and the davtools package to visualize ext3 fragmentation.
051009 on a case where fsck takes more than one month, and some filesystems being VLDBs.
051008 on better disk and memory kernel parameters.
051003 on having switched to JFS for Linux, and considering switching to ext2 for MS Windows.
050925 on some details of the way I disk some previous quick tsts of filesystem speed.
050917 on some suspected troubles with JFS and noatime.
050916 on disk performance, read head and filesystems, and turning ext3 into something else.
050915 on some speculation about performance issues starting programs and filesystems.
050914 on space overheads for metadata and internal fragmentation in various filesystem types.
050913 on how filesystem performance degrades with time.
050912 again in what works means for filesystems.
050910 on the that various meanings of works for file systems.
050908 on testing speed for various filesystems on a root filesystem.
050907 on comparing elevators.
050906 comparing various filesystems as to what they are good at.
050523 on filesystems and write caching.

JFS structure summary (051031)

This is a summary in my own words of this more detailed description of JFS data structures. But there is a much better PDF version of the same document, with inline illustrations, also available inside this RPM from SUSE.

Basic entities

Partition: A partition is a container, and has merely a size and a sector size, also called a partition block size, which defines IO granularity (and is usually the same for all partitions on the physical medium); a partition only contains an aggregate.
Extent: A contiguous sequence of blocks, wholly contained in one allocation group. The maximum size of an extent is 2²⁴-1 blocks, or almost 64GiB. There are a few types of extents, one of them is ABNR which describes an extent contaning zero bytes only.
Map: A map is a collection of extents that contains a B+-tree index rooted in the first extent of the collection; for example it can be an index of extents for a file body, in which case it is an allocation map, or an index of inode names for a directory, in which case it is called a directory map; the extents in a map are described in the map itself. The root extent of the map is called btree and the leaf extents are called xtrees (and contain an array of entries called xads) if they are for an allocation map, and dtrees if they are for a directory map.
File body: A file body is a sequence of one or more extents, the extents being listed in an allocation map. The extents may be from different allocation groups.
Inode: An inode is a 512 byte descriptor for the attributes of a file or directory, and contains also the root of a file body's allocation map, or of a directory map.

Aggregates

Aggregate

An aggregate is about allocating space, and has a size and an aggregate block size, which defines the granularity of allocation of space to files, and currently must be 4096.

Aggregates have a primary and a backup superblock.
Aggregates contain one or more allocation groups.
Aggregates have a primary and backup aggregate inode tables, which must be exactly one 32 inodes long.
Aggregates may contain one or more filesets, but currently only one is allowed.
Aggregates also have some space reserved for use by jfs_fsck.

Allocation group

An allocation group, also known as an AG, is merely a section of an aggregate. There is no data structure associated with an allocation group, all belong either to the aggregate or to a fileset.

There can be up to 128 AGs in an aggregate, and each must be at least 8192 blocks or 32MiB.
Each allocation group must contain a number of blocks that is a power of 2 of the number of block descriptors in a dmap page.
If multiple files are growing, each allocates extents from a different allocation group if possible.

Aggregate inode table

The aggregate inode table is an inode allocation map for the inodes that are used internally by the aggregate, and are not user visible (that is, are not part of any fileset). The inodes defined in the table are:

Number 0 is reserved.
Number 1 is the aggregate inode table itself.
Number 2 is the block allocation map file.
Number 3 is the inline log file.
Number 4 is the bad blocks file.
Number 16 is the fileset root file.

Since the aggregate inode table file refers to itself, the first extent of its inode allocation map has a well known constant address (just after the superblock).

Block allocation map

The block allocation map, also called bmap, is a file (not a B+-tree, despite being called map) divided into 4KiB pages. The first block is the bmap control page, and then there are up to three levels of dmap control pages that point to many dmap pages. Each dmap page contains:

Two arrays of 2¹³ bits where each bit corresponds to a block of the aggregate, and the bit is 1 if the block is in use. Because of the limit of three levels of dmap control pages, there can be at most 2³⁰ dmap pages, and thus at most 2⁴³ blocks in an aggregate.
Some metadata, includings a buddy tree that defines a buddy system of the free and allocated blocks. The buddy tree also extends upwards in the dmap control pages.

The block allocation map contains information that is redundant with that of inode allocation maps, so it can be fully reconstructed, but only with a a full scan of the aggregate and fileset inode tables.

Inline log

A sequence of blocks towards the end of an aggregate that is used to record intended modifications to aggregate or fileset metadata.

Bad blocks

This is a file whose extents cover all the bad blocks discovered by jfs_fsck if any.

Inode allocation maps

Inode allocation map

An inode allocation map is the file body of an inode table file, not a map. This file body contains as the first 4KiB block a control page called dinomap, and after that a number of extents called inode allocation groups.
The dinomap contains:

The AG free inode lists array.
The AG free inode extents lists array.
The IAG free list.
The IAG free next.

which segment the information held in the inode allocation map by allocation group.

AG free inode lists array

The AG free inode lists array contains a list headers for each AG. Each lists threads together all the IAGs in that AG that have some free inode entries.

AG free inode extents lists array

The AG free inode extents lists array contains a list header for each AG, and each list threads together all the IAGs in an AG that have some free inode extents.

IAG free list

The IAG free list array contains a list header for each AG, and each list contains the number of those IAGs in the AG whose inodes are all free.

IAG free next

The IAG free next is the number of the next IAG to append (if required) to an inode allocation map, or equivalently the number of IAGs in an inode allocation map plus 1.

Inode allocation group

An inode allocation group, also called IAG, is a 4KiB block that describes up to 128 inode table extents, for a total of up to 4096 inode table entries.
An inode allocation group can be in any allocation group, but all the inode table extents it describes must be in the same allocation group as the first one, unlike the extents of a general purpose file body, which can be in any allocation group; as soon as its first inode table extent is allocated in a allocation group, the inode allocation group is tied to it, until all such extents are freed.
Once allocated, inode allocation groups are never freed, but their inode table extents may be freed though.

Inode table extent

Inode table extents are pointed to by inode allocation groups, and each must be 16KiB in length, and contains 32 inode table entries.

Filesets

Fileset

A fileset is a collections of named inodes. Filesets are defined as and by a fileset inode table, which is an inode allocation map file. It contains these inodes:

Number 0 is reserved.
Number 1 is a file containing extended fileset information.
Number 2 is a directory which is the root of the fileset naming tree.
Number 3 is a file containing the ACL for the fileset.
Number 4 and following are used for the other files or directories in the fileset, all must be reachable from the directory at number 2.

File

A file is an inode with an attached (optional) allocation map describing a file body that contains data; a particular case of a file is a symbolic link, where the data in the file is a path name.

Directory

A directory is a an inode with a list of name and corresponding inode numbers; the list is either contained entirely within the inode if it is small, or is an attached directory map, containing dtree entries.

Notes about Linux filesystems

File system references (170414)

File system features (120407)

File system hints

File system hints generic (20210419)

File system hints for F2FS (20221102)

File system hints for JFS (121226)

File system hints for ext3 (120304)

File system hints for ext4 (120922)

File system hints for XFS (140925)

File system hints for Btrfs

File system hints for ZFS (201213)

File system hints for NFS (210203)

File system hints for OpenAFS (130913)

File system hints for Lustre (120307)

Some of my notes on filesystems (120307)

JFS structure summary (051031)

File system hints for `ext3` (120304)

File system hints for `ext4` (120922)