This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
ecosystemof native applications and of developers that have invested in the platform.
Thumper(X4500 boxes that Sun has been selling in large numbers, and here is one of the first sighting:
My reckoning is that the above is remarkably inappropriate:The box presents 48 drives, split across 6 SATA controllers. So disks sda-sdh are on one controller, etc. In our configuration, I run a RAID5 MD array for each controller, then run LVM on top of these to form one large VolGroup. I found that it was easiest to setup ext3 with a max of 2TB partitions. So running on top of the massive LVM VolGroup are a handful of ext3 partitions, each mounted in the filesystem. but we're rewriting some software to utilize the multi-partition scheme.
start w/ 3 drives in RAID5, and add drives as I run low on free space, eventually to a total of 14 drives (the max the case can fit). But when I add the 5th or 6th drive, I'd like to switch from RAID5 to RAID6 for the extra redundancy.
Apart from the the perversity, the questions above come without any hint as to the intended use and expected access patterns, without which it is difficult to offer topical comments, and that's typical too of many queries to any mailing list.I'm also interested in hearing people's opinions about LVM / EVMS. I'm currently planning on just using RAID w/out the higher level volume management, as from my reading I don't think they're worth the performance penalty, [ ... ]
mdadm
but that does not
mean that they understand the performance and reliability
implications. If you cannot quote from memory from about a
dozen research papers on storage and filesystems usually you
cannot assess those implications. if you think that this is
elitism, go ahead and make your day.fsck
times.The entertainment value arises both because of the plan to use a filesystem as a simple database manager because of the RAID1 with 5 drives, and of the implication that a RAID1 with 1 drive has 1X redundancy. As to the use of of a filesystem as a database, here are some numbers from a particularly glorious example:I'm testing xfs for use in storing 100 million+ small files (roughly 4 to 10KB each) and some directories will contain tens of thousands of files. There will be a lot of random reading, and also some random writing, and very little deletion. The underlying disks use linux software RAID-1 manged by mdadm with 5X redundancy. E.g. 5 drives that completely mirror each other.
and here is a comment on the results of using a simple database:I have a little script, the job of which is to create a lot of very small files (~1 million files, typically ~50-100bytes each).
[ ... ]
It's a bit of a one-off (or twice, maybe) script, and currently due to finish in about 15 hours, hence why I don't want to spend too much effort on rebuilding the box. Would rather take the chance to maybe learn something useful about tuning...
With some reasons why the filesystem alternative is not such a good idea:First, I have appended two little Perl scripts (each rather small), one creates a Berkeley DB database of K records of random length varying between I and J bytes, the second does N accesses at random in that database.
I have a 1.6GHz Athlon XP with 512MB of memory, and a relatively standard 80GB disc 7200RPM. The database is being created on a 70% full 8GB JFS filesystem which has been somewhat recently created:
---------------------------------------------------------------- $ time perl megamake.pl /var/tmp/db 1000000 50 100 real 6m28.947s user 0m35.860s sys 0m45.530s ---------------------------------------------------------------- $ ls -sd /var/tmp/db* 130604 /var/tmp/db ----------------------------------------------------------------
I am often flummoxed by the ability of people to dream up schemes like the above. In part because I am envious: it must be very liberating to be unconstrained by common sense, and have the courage to try and explore the vast space of syntatically valid combinations, even those that seem implausible to people like me chained by the yearning for pragmatism.
- The size of the tree will be around 1M filesystem blocks on most filesystems, whose block size usually defaults to 4KiB, for a total of around 4GiB, or can be set as low as 512B, for a total of around 0.5GiB.
- With 1,000,000 files and a fanout of 50, we need 20,000 directories above them, 400 above those and 8 above those. So 3 directory opens/reads every time a file has to be accessed, in addition to opening and reading the file.
- Each file access will involve therefore four inode accesses and four filesystem block accesses, probably rather widely scattered. Depending on the size of the filesystem block and whether the inode is contiguous to the body of the file this can involve anything between 32KiB and 2KiB of logical IO per file access.
- It is likely that of the logical IOs those relating to the two top levels (those comprising 8 and 400 directories) of the subtree will be avoided by caching between 200KiB and 1.6MiB, but the other two levels, the 20,000 bottom directories and the 1,000,000 leaf files, won't likely be cached.
fsck
speed being an issue:
As a followup, a couple of years back when I was deploying U320 1TiB arrays at work, we filled each of them with ~800Gb of MP3s, forcibly powered down, and did fsck tests. ext3 was ~12 hours. reiserfs was ~6 hours. xfs was under 2 hours. XFS got used.
Yesterday I had fun time repairing 1.5Tb ext3 partition, containing many millions of files. Of course it should have never happened - this was decent PowerEdge 2850 box with RAID volume, ECC memory and reliable CentOS 4.4 distribution but still it did. We had "journal failed" message in kernel log and filesystem needed to be checked and repaired even though it is journaling file system which should not need checks in normal use, even in case of power failures. Checking and repairing took many hours especially as automatic check on boot failed and had to be manually restarted.
That's for 1TB, 1.5TB, and 2-3TB filesystems, but currently many people would like to deploy 8TB filesystems (and some excessively brave chancers would like much larger filesystems).> I'll definitely be considering that, as I already had to wait hours for > fsck to run on some 2 to 3TB ext3 filesystems after crashes. I know it > can be disabled, but I do feel better forcing a complete check after a > system crash, especially if the filesystem had been mounted for very > long, like a year or so, and heavily used. The decision process for using ext3 on large volumes is simple: Can you accept downtimes measured in hours (or days) due to fsck? No - don't use ext3. There's no workaround for that. Do not ever ignore the need to run fsck periodically. It's a safe thing to do. You can remount the xfs volume as read-only and then run fsck on that - that's another thing to take into account when setting things up.
chunkedfile systems like Lustre are a good idea.
common modeof failure (the even bigger problem is that it is not widely undertstood that distributed computation without a central synchronizing agent belongs to a fundamentally different class of models of computation from those equivalent to Turing machines).