Notes about Linux filesystems

Updated: 2021-02-03
Created: 2005-10-31

File system references (170414)

Older references are not quite accurate, because things in kernel 2.6 are quite better than in kernel 2.4 and filesystem maintainers have reacted to older unfavourable benchmarks by tuning their designs. So the references below are ordered by most recent first.

Descriptions (20210305)
Warnings: many of these benchmarks not only are designed somewhat naively, some truly essential aspects of the context, like the elevator or the filesystem readahead, are not mentioned; benchmarks under Linux 2.6 can give very different results from under Linux 2.4; SCSI and ATA/IDE disc drives have very, very different performance profiles, including sync reporting.
Online discussions
Warning: some of these discussions are listed here because I think that they are notably wrong. Some pointers are to single articles, some to threads.

File system features (120407)

Desktop filesystem features
Feature ext3 JFS XFS
Block sizes 1024-4096 4096 512-4096
Max fs size 8TiB (243B) 32PiB (255B) 8EiB (263B)
16TiB (244B) on 32b system
Max file size 1TiB (240B) 4PiB (252B) 8EiB (263B)
16TiB (244B) on 32b system
Max files/fs 232 232 232
Max files/dir 232 231 232
Max subdirs/dir 215 216 232
Number of inodes fixed dynamic dynamic
Indexed dirs option auto auto
Small data in inodes no auto (xattrs, dirs) auto (xattrs, extent maps)
fsck speed slow fast fast
fsck space ? 32B per inode 2GiB RAM per 1TiB + 200B per inode
(half on 32b CPU)
Redundant metadata yes yes no
Bad block handling yes mkfs only no
Tunable commit interval yes no metadata
Supports VFS lock yes yes yes
Has own lock/snapshot no no yes
Names 8 bit UTF-16 or 8 bit 8 bit
noatime yes yes yes
O_DIRECT yes yes yes
barrier yes no yes (and checks)
commit interval yes no no
EA/ACLs both both both
Quotas both both both
DMAPI no patch option
Case insensitive no mkfs only mkfs only
(since 2.6.28)
Supported by GRUB yes yes mostly
Can grow online online only online only
Can shrink offline no no
Journals data option no no
Journals what blocks operations operations
Journal disabling yes yes no
Journal size fixed fixed grow/shrink
Resize journal offline maybe offline
Journal on another partition yes yes yes
Special features or misfeatures In place convert from ext2.
MS Windows drivers.
Case insensitive option.
Low CPU usage.
DCE DFS compatible.
OS2 compatible.
Real time (streaming) section.
IRIX compatible.
Very large write behind.
Project (subtree) quotas.
Superblock on sector 0.

File system hints

This section is about known hints and issues with various aspects of common filesystems. They can be just inconveniences or limitations or severe performance problems.

File system hints generic (20210419)

Release dependent:

File system hints for F2FS (20221102)

File system hints for JFS (121226)

File system hints for ext3 (120304)

File system hints for ext4 (120922)

File system hints for XFS (140925)

Kernel version dependent hints:

Kernel version independent hints:

File system hints for Btrfs

Inside the Btrfs notes page.

File system hints for ZFS (201213)

Release independent: Release dependent:

File system hints for NFS (210203)

Release dependent:

Release dependent:

Summary of conditions for a working NFSv4 with Kerberos GSSAPI authentication and/or encryption:

Some useful pages for using NFSv4 with Kerberos:

File system hints for OpenAFS (130913)

Version dependent:

Version independent:

File system hints for Lustre (120307)

Some of my notes on filesystems (120307)

These are pointers to some of the entries in my technical blog where filesystems are discussed:

JFS structure summary (051031)

This is a summary in my own words of this more detailed description of JFS data structures. But there is a much better PDF version of the same document, with inline illustrations, also available inside this RPM from SUSE.

Basic entities
A partition is a container, and has merely a size and a sector size, also called a partition block size, which defines IO granularity (and is usually the same for all partitions on the physical medium); a partition only contains an aggregate.
A contiguous sequence of blocks, wholly contained in one allocation group. The maximum size of an extent is 224-1 blocks, or almost 64GiB. There are a few types of extents, one of them is ABNR which describes an extent contaning zero bytes only.
A map is a collection of extents that contains a B+-tree index rooted in the first extent of the collection; for example it can be an index of extents for a file body, in which case it is an allocation map, or an index of inode names for a directory, in which case it is called a directory map; the extents in a map are described in the map itself. The root extent of the map is called btree and the leaf extents are called xtrees (and contain an array of entries called xads) if they are for an allocation map, and dtrees if they are for a directory map.
File body
A file body is a sequence of one or more extents, the extents being listed in an allocation map. The extents may be from different allocation groups.
An inode is a 512 byte descriptor for the attributes of a file or directory, and contains also the root of a file body's allocation map, or of a directory map.
An aggregate is about allocating space, and has a size and an aggregate block size, which defines the granularity of allocation of space to files, and currently must be 4096.
  • Aggregates have a primary and a backup superblock.
  • Aggregates contain one or more allocation groups.
  • Aggregates have a primary and backup aggregate inode tables, which must be exactly one 32 inodes long.
  • Aggregates may contain one or more filesets, but currently only one is allowed.
  • Aggregates also have some space reserved for use by jfs_fsck.
Allocation group
An allocation group, also known as an AG, is merely a section of an aggregate. There is no data structure associated with an allocation group, all belong either to the aggregate or to a fileset.
  • There can be up to 128 AGs in an aggregate, and each must be at least 8192 blocks or 32MiB.
  • Each allocation group must contain a number of blocks that is a power of 2 of the number of block descriptors in a dmap page.
  • If multiple files are growing, each allocates extents from a different allocation group if possible.
Aggregate inode table
The aggregate inode table is an inode allocation map for the inodes that are used internally by the aggregate, and are not user visible (that is, are not part of any fileset). The inodes defined in the table are: Since the aggregate inode table file refers to itself, the first extent of its inode allocation map has a well known constant address (just after the superblock).
Block allocation map
The block allocation map, also called bmap, is a file (not a B+-tree, despite being called map) divided into 4KiB pages. The first block is the bmap control page, and then there are up to three levels of dmap control pages that point to many dmap pages. Each dmap page contains:
  • Two arrays of 213 bits where each bit corresponds to a block of the aggregate, and the bit is 1 if the block is in use. Because of the limit of three levels of dmap control pages, there can be at most 230 dmap pages, and thus at most 243 blocks in an aggregate.
  • Some metadata, includings a buddy tree that defines a buddy system of the free and allocated blocks. The buddy tree also extends upwards in the dmap control pages.
The block allocation map contains information that is redundant with that of inode allocation maps, so it can be fully reconstructed, but only with a a full scan of the aggregate and fileset inode tables.
Inline log
A sequence of blocks towards the end of an aggregate that is used to record intended modifications to aggregate or fileset metadata.
Bad blocks
This is a file whose extents cover all the bad blocks discovered by jfs_fsck if any.
Inode allocation maps
Inode allocation map
An inode allocation map is the file body of an inode table file, not a map. This file body contains as the first 4KiB block a control page called dinomap, and after that a number of extents called inode allocation groups.
The dinomap contains:
  • The AG free inode lists array.
  • The AG free inode extents lists array.
  • The IAG free list.
  • The IAG free next.
which segment the information held in the inode allocation map by allocation group.
AG free inode lists array
The AG free inode lists array contains a list headers for each AG. Each lists threads together all the IAGs in that AG that have some free inode entries.
AG free inode extents lists array
The AG free inode extents lists array contains a list header for each AG, and each list threads together all the IAGs in an AG that have some free inode extents.
IAG free list
The IAG free list array contains a list header for each AG, and each list contains the number of those IAGs in the AG whose inodes are all free.
IAG free next
The IAG free next is the number of the next IAG to append (if required) to an inode allocation map, or equivalently the number of IAGs in an inode allocation map plus 1.
Inode allocation group
An inode allocation group, also called IAG, is a 4KiB block that describes up to 128 inode table extents, for a total of up to 4096 inode table entries.
An inode allocation group can be in any allocation group, but all the inode table extents it describes must be in the same allocation group as the first one, unlike the extents of a general purpose file body, which can be in any allocation group; as soon as its first inode table extent is allocated in a allocation group, the inode allocation group is tied to it, until all such extents are freed.
Once allocated, inode allocation groups are never freed, but their inode table extents may be freed though.
Inode table extent
Inode table extents are pointed to by inode allocation groups, and each must be 16KiB in length, and contains 32 inode table entries.
A fileset is a collections of named inodes. Filesets are defined as and by a fileset inode table, which is an inode allocation map file. It contains these inodes:
  • Number 0 is reserved.
  • Number 1 is a file containing extended fileset information.
  • Number 2 is a directory which is the root of the fileset naming tree.
  • Number 3 is a file containing the ACL for the fileset.
  • Number 4 and following are used for the other files or directories in the fileset, all must be reachable from the directory at number 2.
A file is an inode with an attached (optional) allocation map describing a file body that contains data; a particular case of a file is a symbolic link, where the data in the file is a path name.
A directory is a an inode with a list of name and corresponding inode numbers; the list is either contained entirely within the inode if it is small, or is an attached directory map, containing dtree entries.