Notes about Linux storage
This section is about known hints and issues with
various aspects of common filesystems. They can be just
inconveniences or limitations or severe performance
- Most consumer disks or SSDs do not allow shortening a very
long period (often minutes) of recovery in case of disk
errors, which can cause unnessary RAID and service failures
- Disks that allow setting a shorter period of recovery in
case of disks errors require a tool that handles
which is only part of
version 5.40 or newer.
- The Linux storage system has
some error recovery duration periods and counts that can be
set shorter (and often should be set shorter but not shorter
than the SCT ERC ones) by setting
some kernel parameters
maximum time to wait for IO completion on NVME
maximum number of times to redo commands on NVME
timeout before a failed SCSI/SAS/SATA operation results
in a reset of the whole host bus adapter. By default
this is disabled.
error handling timeout, which the number of seconds for
which to wait for a reply to TEST UNIT READY and
REQUEST SENSE SCSI commands after an error
has been reported.
error retry timeout, for how many second to attempt to
redo a failed operation. Usually it should be a multiple
error retry count, how many times to redo a failed
operation. Only from kernel version 5.10.
- Mounting filetrees on flash SSD devices with option
discard can result in occasional relatively long
delays because the related
TRIM operation is synchronous.
- The sector size for CDs is 2KiB.
- The sector size for DVDs is 32KiB, but they are required
to simulate a 2KiB sector size, potentially triggering
- In Linux 2.6.37
the barrier logic in the kernel was replaced by
better FUA logic
which should increase the parallel scheduling of IO
Version dependent hints for the mdadm:
- Version 5.14 has an improvement to
reduce locking contention
for very high IOPS workloads resulting in much lower CPU
usage and somewhat higher IOPS rates.
- Version 5.10 has a bug where the size of RAID sets is
recalculated erroneously. It is fixed in 5.10.1 and
the right size can be reset by using mdadm --grow.
- Version 3.3 allows setting explicitly the data
- Version 3.2.5 automatically reduces the default 128MiB
data offset if this is required to fit data in the
- Version 3.2.4 introduces a default data offset of 128MiB
for version 1.1 and version 1.2 metadata.
- Version 3.1.2 changes the default data offset to 1MiB for
version 1.1 and 1.2 metadata.
- Version 3.1.2 changes the defauilt metadata type to
Version dependent hints for the kernel module:
- Kernel 3.10.3 has a bug that can cause hangs on
- In corner cases with
kernel versions 3.2.1 and 3.3
a bug can cause the MD superblock to become corrupted on
reboot. The symptoms are that the MD set is inactive, all or
most members seem spares, and mdadm --examine
applied to the spares shows MD superblocks without a valid
RAID level and number of devices in the MD set.
- In kernels released before 2013
fsync on a read-only MD device is broken.
- In kernel 2.6.37
WRITE_FUA support was added to MD.
- In kernel 2.6.33
MD got barrier support for all types of RAID.
- In kernel 2.6.28 MD switched handling of MD devices with
respect to partitioning:
In Linux kernels prior to version 2.6.28 there were two
distinctly different types of md devices that could be
created: one that could be partitioned using standard
partitioning tools and one that could not. Since 2.6.28
that distinction is no longer relevant as both type of
devices can be partitioned.
on a read error don't try to read from redundant data and
writing back to heal the error.
- Kernels before 2.6.10 can only use version 0.90 metadata
Version independent hints:
- Different versions of mdadm
write different member geometries that have the same metadata version
Because the metadata version only indicates different
superblock offsets, and remains the same even with different
This is extremely dangerous when
- If an MD RAID superblock is at the end of a partition,
which is at the end of a partitioned block device, methods
MD RAID member autodiscovery can wrongly guess
that the whole disk is an MD RAID set member, as the
superblock is at the end of the disk too.
- The type of superblock gets OR'ed with 1 when the MD
RAID set is being reshaped.
- The initial sync for RAID5 and RAID6 can be much faster if
the array is created with spares and missing drives:
raid6 resync (like raid5)
is optimised for an array the is already in sync. It
reads everything and checks the P and Q blocks. When it
finds P or Q that are wrong, it calculates the correct
value and goes back to write it out.
On a fresh drive, this will involve lots of writing
which means seeking back to write something. With a
larger stripe_cache, the writes can presumably be done
in larger slabs so there are fewer seeks.
You might get a better result by creating the array
with two missing devices and two spares. It will then
read the good devices completely linearly, and write the
spares completely linearly and so should get full
hardware speed with normal stripe_cache size.
For raid5, mdadm makes this arrangement automatically.
It doesn't for raid6.
- DM/LVM2 snapshot LVs can be very slow.
Version dependent hints:
Version independent hints:
indexed disk-metadata the maximum
size of a DRBD block device is 4TiB or 4096GiB, not 4GiB as
the manual page says, given the 128MiB size of each indexed