Updated: 2021-02-03
Created: 2005-10-31
Older references are not quite accurate, because things in kernel 2.6 are quite better than in kernel 2.4 and filesystem maintainers have reacted to older unfavourable benchmarks by tuning their designs. So the references below are ordered by most recent first.
ext2
page.ext3
page
and
the ext3
mailing list.ext4
page.ext3
FAQ
2004-10-14.Feature | ext3 |
JFS | XFS |
---|---|---|---|
Block sizes | 1024-4096 | 4096 | 512-4096 |
Max fs size | 8TiB (243B) | 32PiB (255B) | 8EiB (263B)
16TiB (244B) on 32b system |
Max file size | 1TiB (240B) | 4PiB (252B) | 8EiB (263B)
16TiB (244B) on 32b system |
Max files/fs | 232 | 232 | 232 |
Max files/dir | 232 | 231 | 232 |
Max subdirs/dir | 215 | 216 | 232 |
Number of inodes | fixed | dynamic | dynamic |
Indexed dirs | option | auto | auto |
Small data in inodes | no | auto (xattrs, dirs) | auto (xattrs, extent maps) |
fsck speed |
slow | fast | fast |
fsck space |
? | 32B per inode | 2GiB RAM per 1TiB + 200B per inode
(half on 32b CPU) |
Redundant metadata | yes | yes | no |
Bad block handling | yes | mkfs only | no |
Tunable commit interval | yes | no | metadata |
Supports VFS lock | yes | yes | yes |
Has own lock/snapshot | no | no | yes |
Names | 8 bit | UTF-16 or 8 bit | 8 bit |
noatime |
yes | yes | yes |
O_DIRECT |
yes | yes | yes |
barrier |
yes | no | yes (and checks) |
commit interval | yes | no | no |
EA/ACLs | both | both | both |
Quotas | both | both | both |
DMAPI | no | patch | option |
Case insensitive | no | mkfs only |
mkfs only (since 2.6.28) |
Supported by GRUB | yes | yes | mostly |
Can grow | online | online only | online only |
Can shrink | offline | no | no |
Journals data | option | no | no |
Journals what | blocks | operations | operations |
Journal disabling | yes | yes | no |
Journal size | fixed | fixed | grow/shrink |
Resize journal | offline | maybe | offline |
Journal on another partition | yes | yes | yes |
Special features or misfeatures | In place convert from ext2 .
MS Windows drivers. |
Case insensitive option.
Low CPU usage. DCE DFS compatible. OS2 compatible. |
Real time (streaming) section.
IRIX compatible. Very large write behind. Project (subtree) quotas. Superblock on sector 0. |
This section is about known hints and issues with various aspects of common filesystems. They can be just inconveniences or limitations or severe performance problems.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7a6e59d719ef0ec9b3d765cba3ba98ee585cbde3Release dependent:
inodesize of 128 bytes is used or kept. If so, multiple updates per second are not recorded. This can impact make processing and fsync.
Kernel version dependent hints:
The default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.
Kernel version independent hints:
rotors directories across AGs, and then attempts to allocate space for new files in the AG containing the directory, which is quite different from the alternative because
if you create a bunch of files in the same directory, without inode64 XFS will scatter the extents all over the disk rather than trying to allocate them next to each other.
allocation group, if all allocation groups are in use to grown extents writing can stop for all other files, or similarly if the files are in the same allocation group. Having more allocations groups typically improves multithreaded performance.
allocation groupsize is a multiple of the underlying RAID
stripethen the allocation groups and (and their metadata) may end up on the same disks, preventing parallel IO across the stripe. If mkfs.xfs can discover the underlying RAID geometry it will warn about this with the message:
Warning: AG size is a multiple of stripe width. This can cause performance problems by aligning all AGs on the same disk. To avoid this, run mkfs with an AG size that is one stripe unit smaller, for example %llu.The solution indicated is to manually specify an allocation group size that is not congruent with the stripe width, usually a bit smaller.
will disable XFS' write barrier support.
Inside the Btrfs notes page.
Release dependent:
Release dependent:
Stale file error.
- a change in RHEL 6.3: sunrpc.tcp_max_slot_table_entries dynamically allocating RPC slots up to the maximum (65536).
- Reverting to previous limit of 128 recovered system stability.
prepare_krb5_rfc_cfx_buffer: not implemented
Summary of conditions for a working NFSv4 with Kerberos GSSAPI authentication and/or encryption:
realmname used for both.
echo -n N > /sys/module/nfs/parameters/nfs4_disable_idmapping
saltshould be normal unless using a kaserver in which case it should be afs3 and v4 only if relying on a Kerberos4 KDC.
enctypedes-cbc-src for kernel versions older than 2.6.35.
Some useful pages for using NFSv4 with Kerberos:
Version dependent:
enctypesthan des-cbc-crc and des-cbc-md5, and they should be used because 56-bit DES encryption is quite easy to break.
Wed Dec 18 08:52:47 2013 SYNC_getCom: error receiving command Wed Dec 18 08:52:47 2013 FSYNC_com: read failed; dropping connection (cnt=8568) Wed Dec 18 08:52:47 2013 SYNC_getCom: error receiving command Wed Dec 18 08:52:47 2013 FSYNC_com: read failed; dropping connection (cnt=8569) mode c1ff 6: dev 7, inode 25811, length 0, type/mode c1ff 7: dev 7, inode 9532, length 0, type/mode c1ff 8: dev 7, inode 21498, length 0, type/mode c1ff 9: dev 905, inode 95580989, length 65, type/mode 8000 10: dev 905, inode 94903997, length 2048, type/mode 8000 11: dev 905, inode 97388566, length 8448, type/mode 8003in the FileLog.
ulimit -n 1021in the script that starts BOS. The best solution is to upgrade to the latest release, as 1.6.1 has this and other known issues.
partitionsup to and including OpenAFS version 1.4 must be on an ext2 or ext3 filesystem.
Version independent:
volumeis actually a subtree of AFS directories and files, and a
partitionthat holds volumes is actually a subtree of some native operating system filesystem, whether the partition is on a fileserver or is the cache on a client.
partitionanything that is mounted under directories whose name begins with
vicepin the system's root directory.
partitionholding OpenAFS volumes, as long as it is mounted vi a loop device.
partitionfor AFS volumes on an OpenAFS fileserver does not need to be in its own dedicated block device, and neither does the AFS cache filetree on an OpenAFS client, but out of space conditions caused by space in the filetree being less than that declared for the OpenAFS may be handled badly. The /vicepAB
partitionswhich are not mount points will be however ignored unless they contain a file called AlwaysAttach.
cellnames are case insensitive but they are stored internally in uppercase and printed in lower case. As a rule by convention they should always bwe specified in lower case, as there are default mappings to case sensitive Kerberos realm names in all upper cases and to case insentitive DNS domain names in all lower case.
dynrootbecause it relies on libafscp which does not handle synthetic roots.
windowstyle flow control algorithm similar to TCP, but the maximumwindow size is much smaller, which limits performance links with a large BDP. The protocol allows up to 256 outstanding packets, but versions of OpenAFS limit that for 32 packets, with the exception of the YFS version which allows for the full 256 packets.
So, setting a UDP buffer of 8Mbytes from user space is _just_ enough to handle 4096 incoming RX packets on a standard ethernet. However, it doesn't give you enough overhead to handle pings and other management packets. 16Mbytes should be plenty providing that you don't
a) Dramatically increase the number of threads on your fileserver
b) Increase the RX window size
c) Increase the ethernet frame size of your network (what impact this has depends on the internals of your network card implementation)
d) Have a large number of 1.6.0 clients on your networkTo summarise, and to stress Dan's original point - if you're running with the fileserver default buffer size (64k, 16 packets), or with the standard Linux maximum buffer size (128k, 32 packets), you almost certainly don't have enough buffer space for a loaded fileserver.
partitionas the read-write one, as that is essentially free as it does not require file copying.
release(updating read-only volumes to have the same content as a read-write volume) the read-only replica in the same partitions gets updated very quickly, and then other read-only replicas get updated from it, reducing the latency of the
releaseoperation.
partitionson the same server is a design error, and there are checks against that, but some corner cases can be missed by the checks.
quorumrelated reasons the number of AFS db servers should be odd (1, 2).
These are pointers to some of the entries in my technical blog where filesystems are discussed:
fsck
timesext2
for all my MS
Windows filesystems except the boot one.ext3
with and without extended attributes
and ext3
's new hash directory indices.fsck
.davtools
package to visualize
ext3
fragmentation.fsck
takes more than one
month, and some filesystems being VLDBs.ext2
for MS Windows.noatime
.ext3
into something else.worksmeans for filesystems.
worksfor file systems.
rootfilesystem.
This is a summary in my own words of this more detailed description of JFS data structures. But there is a much better PDF version of the same document, with inline illustrations, also available inside this RPM from SUSE.
ABNR
which describes an extent
contaning zero bytes only.btree
and the leaf extents are
called xtree
s (and contain an array of
entries called xads)
if they are for an allocation map, and dtree
s
if they are for a directory map.jfs_fsck
.bmap
,
is a file (not a B+-tree, despite being
called map) divided into 4KiB pages. The first block is the bmap control page, and then there are up to three levels of dmap control pages that point to many dmap pages. Each dmap page contains:
jfs_fsck
if any.dinomap
, and after that a number
of extents called
inode allocation groups.
dinomap
contains:
tiedto it, until all such extents are freed.
dtree
entries.