Notes about Linux storage

Storage hints

This section is about known hints and issues with various aspects of common filesystems. They can be just inconveniences or limitations or severe performance problems.

Storage hints for disks (210725)

Most consumer disks or SSDs do not allow shortening a very long period (often minutes) of recovery in case of disk errors, which can cause unnessary RAID and service failures in server.
Disks that allow setting a shorter period of recovery in case of disks errors require a tool that handles SCT ERC which is only part of smartctl version 5.40 or newer.
The Linux storage system has some error recovery duration periods and counts that can be set shorter (and often should be set shorter but not shorter than the SCT ERC ones) by setting some kernel parameters in particular:
- /sys/module/nvme_core/parameters/{io,admin}_timeout: maximum time to wait for IO completion on NVME devices.
- /sys/module/nvme_core/parameters/max_retries: maximum number of times to redo commands on NVME devices.
- /sys/module/scsi_mod/parameters/eh_deadline: timeout before a failed SCSI/SAS/SATA operation results in a reset of the whole host bus adapter. By default this is disabled.
- /sys/block/name/device/eh_timeout: error handling timeout, which the number of seconds for which to wait for a reply to TEST UNIT READY and REQUEST SENSE SCSI commands after an error has been reported.
- /sys/block/name/device/timeout: error retry timeout, for how many second to attempt to redo a failed operation. Usually it should be a multiple of eh_timeout.
- /sys/block/name/device/scsi_disk/*/max_retries: error retry count, how many times to redo a failed operation. Only from kernel version 5.10.

Storage hints for flash SSDs (120812)

Mounting filetrees on flash SSD devices with option discard can result in occasional relatively long delays because the related TRIM operation is synchronous.

Storage hints for CD/DVD (120812)

The sector size for CDs is 2KiB.
The sector size for DVDs is 32KiB, but they are required to simulate a 2KiB sector size, potentially triggering RMW issues.

Storage hints for block layer (120923)

In Linux 2.6.37 the barrier logic in the kernel was replaced by better FUA logic which should increase the parallel scheduling of IO operations.

Storage hints for SCSI (120923)

In kernel version 2.6.38 a significant bug in iSCSI initiators related to WRITE_FUA support was fixed.

Storage hints for MD RAID (20210801)

Version dependent hints for the mdadm:

Version 5.14 has an improvement to reduce locking contention for very high IOPS workloads resulting in much lower CPU usage and somewhat higher IOPS rates.
Version 5.10 has a bug where the size of RAID sets is recalculated erroneously. It is fixed in 5.10.1 and the right size can be reset by using mdadm --grow.
Version 3.3 allows setting explicitly the data offset.
Version 3.2.5 automatically reduces the default 128MiB data offset if this is required to fit data in the member.
Version 3.2.4 introduces a default data offset of 128MiB for version 1.1 and version 1.2 metadata.
Version 3.1.2 changes the default data offset to 1MiB for version 1.1 and 1.2 metadata.
Version 3.1.2 changes the defauilt metadata type to 1.2.

Version dependent hints for the kernel module:

Kernel 3.10.3 has a bug that can cause hangs on ???
In corner cases with kernel versions 3.2.1 and 3.3 a bug can cause the MD superblock to become corrupted on reboot. The symptoms are that the MD set is inactive, all or most members seem spares, and mdadm --examine applied to the spares shows MD superblocks without a valid RAID level and number of devices in the MD set.
In kernels released before 2013 fsync on a read-only MD device is broken.
In kernel 2.6.37 WRITE_FUA support was added to MD.
In kernel 2.6.33 MD got barrier support for all types of RAID.
In kernel 2.6.28 MD switched handling of MD devices with respect to partitioning:

In Linux kernels prior to version 2.6.28 there were two distinctly different types of md devices that could be created: one that could be partitioned using standard partitioning tools and one that could not. Since 2.6.28 that distinction is no longer relevant as both type of devices can be partitioned.
Kernels before 2.6.15 on a read error don't try to read from redundant data and writing back to heal the error.
Kernels before 2.6.10 can only use version 0.90 metadata superblocks.

Version independent hints:

Different versions of mdadm write different member geometries that have the same metadata version Because the metadata version only indicates different superblock offsets, and remains the same even with different data offsets.
This is extremely dangerous when
If an MD RAID superblock is at the end of a partition, which is at the end of a partitioned block device, methods of MD RAID member autodiscovery can wrongly guess that the whole disk is an MD RAID set member, as the superblock is at the end of the disk too.
The type of superblock gets OR'ed with 1 when the MD RAID set is being reshaped.
The initial sync for RAID5 and RAID6 can be much faster if the array is created with spares and missing drives:

raid6 resync (like raid5) is optimised for an array the is already in sync. It reads everything and checks the P and Q blocks. When it finds P or Q that are wrong, it calculates the correct value and goes back to write it out.

On a fresh drive, this will involve lots of writing which means seeking back to write something. With a larger stripe_cache, the writes can presumably be done in larger slabs so there are fewer seeks.

You might get a better result by creating the array with two missing devices and two spares. It will then read the good devices completely linearly, and write the spares completely linearly and so should get full hardware speed with normal stripe_cache size.

For raid5, mdadm makes this arrangement automatically. It doesn't for raid6.

Storage hints for DM/LVM2 (120812)

DM/LVM2 snapshot LVs can be very slow.

Storage hints for DRBD (120413)

Version dependent hints:

Until 8.3.14 or 8.4.2 DRBD did not have fully working barrier support.