Computing notes 2017 part one

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg Technorati]

170302 Thu: Some coarse speed tests with Btrfs etc. and small files

In the previous note about a simple speed test of several Linux filesystems star was used because unlike GNU tar it does fsync(2) before close(2) of a written file, and I added that for Btrfs I also ensured that metadata was (as per default) duplicated as per the dup profile, and these were challenging details.

To demonstrate how challenging I have done some further coarse tests on a similar system, copying from a mostly root filesystem (which has many small files) to a Btrfs filesystem, with and without the star option -no-fsync and with Btrfs metadata single and dup in order of increasing write rate with fsync:

sys CPU
dup fsync 381m 28s 3.3MB/s 13m 14s
single fsync 293m 26s 4.2MB/s 13m 15s
dup no-fsync 20m 09s 61.7MB/s 3m 50s
single no-fsync 19m 58s 62.3MB/s 3m 41s

For comparison the same source and target and copy command with some other filesystems also in order of increasing write rate with fsync:

sys CPU
XFS fsync 318m 49s 3.9MB/s 7m 55s
XFS no-fsync 21m 24s 58.2MB/s 4m 10s
F2FS fsync 239m 04s 5.2MB/s 9m 09s
F2FS no-fsync 21m 32s 57.8MB/s 5m 02s
NILFS2 fsync 118m 27s 10.5MB/s 7m 52s
NILFS2 no-fsync 23m 17s 53.5MB/s 5m 03s
JFS fsync 21m 45s 57.2MB/s 5m 11s
JFS no-fsync 19m 42s 63.2MB/s 4m 24s

Source filetree was XFS on a very fast flash SSD, so not a bottleneck.
Source filetree 70-71GiB (74-75GB) and 0.94M inodes (0.10M of which directories), of which 0.48M under 1KiB, 0.69M under 4KiB and 0.8M under 8KiB (the Btrfs allocation block size).
Observed occasionally IO with blktrace and blkparse and wikth fsync virtually all IO was synchronous, as fsync on a set of files of 1 block is essentially the same as fsync on every block.
Because of its (nearly) copy-in-write nature Btrfs handles transactions with multiple fsyncs particularly well.

Obviously fsync per-file on most files is very expensive, and dup file metadata is also quite expensive, but nowhere as much as fsync. A very strong demonstration that the performance envelope of storage system is rather anisotropic. It is also ghastly interesting that given the same volume of data IO with fsync costs more than 3 times the system CPU time as without. Some filesystem specific notes:

170228 Tue: Some coarse speed tests with various Linux filesystems

Having previously mentioned my favourite filesystems I have decided to do again in a different for a rather informative, despite being simplistic and coarse, test of their speed similar to one I did a while ago, with some useful results. The test is:

It is coarse, it is simplistic, but it gives some useful upper bounds on how filesystem does in a fairly optimal case. In particular the write test, involving as it does a fair bit of synchronous writing and seeking for metadata, is fairly harsh; even if, as the test involves writing to a fresh, empty filetree, pretty much an ideal condition, it does not account at all for fragmentation on rewrites and updates. The results, commented below, sorted by fastest, in two tables for writing and reading:

type write
sys CPU
JFS 148m 01s 72.4MB/s 24m 36s
F2FS 170m 28s 62.9MB/s 26m 34s
OCFS2 183m 52s 58.3MB/s 36m 00s
XFS 198m 06s 54.1MB/s 23m 28s
NILFS2 224m 36s 47.7MB/s 32m 04s
ZFSonLinux 225m 09s 47.6MB/s 18m 37s
UDF 228m 47s 46.9MB/s 24m 32s
ReiserFS 236m 34s 45.3MB/s 37m 14s
Btrfs 252m 42s 42.4MB/s 21m 42s
type read
sys CPU
F2FS 106m 25s 100.7MB/s 66m 57s
Btrfs 108m 59s 98.4MB/s 71m 25s
OCFS2 113m 42s 94.3MB/s 66m 39s
UDF 116m 35s 92.0MB/s 66m 54s
XFS 117m 10s 91.5MB/s 66m 03s
JFS 120m 18s 89.1MB/s 66m 38s
ZFSonLinux 125m 01s 85.8MB/s 23m 11s
ReiserFS 125m 08s 85.7MB/s 69m 52s
NILFS2 128m 05s 83.7MB/s 69m 41s

The system was otherwise quiescent.
I have watched the various tests with iostat, vmstat, and looking at graphs produced by collectd, displayed by kcollectd and sometimes I have used blkstat blktrace; I have also used occasionally used strace to look at the IO operations requested by star.
Having looked at actual behaviour, I am fairly sure that all involved filesystem respected fsync semantics.
The source disk seemed at all times to not be the limiting factor for the copy, in particular as streaming reads are rather faster, as shown above, than writes.

The first comment on the numbers above is the obscene amount of system CPU time taken, especially for reading. That the system CPU time taken for reading being 2.5 times or more higher than that for writing is also absurd. The test system has an 8 thread AMD 8270e CPU with a highly optimized microarchitecture, 8MiB of level 3 cache, and a 3.3GHz clock rate.

The the system CPU time for most filesystem types is roughly the same, again especially for reading, which indicates that there is common cause that is not filesystem specific. For F2FS the system CPU time for reading is more than 50% of the elapsed time, an extreme case. It is interesting to see that ZFSonLinux, which has uses its own cache implementation, ARC has a system CPU time of roughly 1/3 that of the others.

That Linux block IO involves an obscene amount of system CPU time to do IO I had noticed already over 10 years ago and that the issue has persisted so far is a continuing assessment of the Linux kernel developers in charge of developing the block IO subsystem.

Another comment that applies across all filesystems is that the range of speeds is not that different, all of them had fairly adequate, reasonable speeds given the device. While there is a range of better to worse, this is to be expected from a coarse test like this, and a different ranking will apply to different workloads. What this coarse test says is that none of these filesystems is is particularly bad on this, all of them are fairly good.

Another filesystem independent aspect is that the absolute values are much higher, at 6-7 times better, than those I reported only four years ago. My guess is is this mostly because the previous test involved the Linux source tree which contains a a large number of very small files; but also because the hardware was an old system that I was no longer using in 2012, indeed I had not used since 2006.

As to the selection of filesystem types tested, the presence of F2FS, OCFS2, UDF, NILFS2 may seem surprising, as they are considered special-case or odd filesystems. Even if F2FS was targeted at flash storage, OCFS2 at shared-device clusters, UDF at DVDs and BDs, and NILFS2 at "continuous snapshotting", they are actually all general purpose, POSIX-compatible filesystems that work well on disk drives and with general purpose workloads. I have also added ZFSonLinux, even if I don't like for various reasons, as a a paragon. I have omitted a test of ext4 because I reckon that it is a largely pointless filesystem design, that exists only because of in-place upgradeability from ext3, which in turn was popular only because of in-place upgradeability from ext2, when the installed base of Linux was much smaller. Also OCFS2 has a design quite similar to that of the ext3 filesystem, and has some more interesting features.

Overall the winners, but not large margins, from this test seem to be F2FS, JFS, XFS, OCFS2. Some filesystem specific notes, in alphabetical order:

170214 Mar: New Seagate disc drives and their declared duty cycles

Some time ago I mentioned an archival disk drive model from HGST which had a specific lifetime rating for total reads and writes, of 180TB per year as compared to the 550TB per year of a similar non-archival disk drive.

The recent series of IronWolf, IronWolf Pro and Enterprise Capacity disk drives from Seagate which are targeted largely at cold storage use have similar ratings:

It is interesting also that the lower capacity drives are rated for Nonrecoverable Read Errors per Bits Read, Max of 1 per 10E14 and the large ones for 1 in 1E15 (pretty much industry standard), because 180TB per year is roughly 10E13, and the drives probably are designed to last 5-10 years (they have 3-5 years warranties).

170205 Sun: A straightforward alternative to the setuid mechanism too

Having just illustrated a simple confinement mechanism for UNIX/Linux systems that uses the regular UNIX/Linux style permissions, I should add that the same mechanism can also replace in one simple unified mechanism the setuid protection domain switching of UNIX/Linux systems. The mechanism would be to add to each process, along with its effective id (user/group) what I would now call a preventive id with the following rules:

Note: there are some other details to take care of, like apposite rules for access to a process via a debugger. The logic of the mechanism is that it is safe to let a process operate under the preventive id of its executable, because the program logic of the executable is under the control of the owner of the executable, and that should not be subverted.

The mechanism above is not quite backwards compatible with the UNIX/Linux semantics because it makes changes in the effective or preventive ids depend on explicit process actions, but it can be revised to be backwards compatible with the following alternative rules:

Note: the implementation of either variant of the mechanism is trivial, and in particular adding preventive id fields to a process does not require backward incompatible changes as process attributes are not persistent.

The overall logic is that in the UNIX/Linux semantics for a process to work across two protection domains it must play between the user and group ids; but it is simpler and more general to have the two protection domains identified directly by two separate ids for the running process.

170204 Sat: Aggregate cost of AWS servers

BusinessWeek has an interesting article on some businesses that offer tools to minimize cloud costs and a particularly interesting example they make contains these figures:

Proofpoint rents about 2,000 servers from Amazon Web Services (AWS),’s cloud arm, and paid more than $10 million in 2016, double its 2015 outlay. “Amazon Web Services was the largest ungoverned item on the company’s budget,” Sutton says, meaning no one had to approve the cloud expenses.

$10m for 2,000 virtual machines means $5,000 per year per VM, or $25,000 over the 5-year period where a physical server would be depreciated. That buys a very nice physical server and 5 years of colocation including power and cooling and remote hands, plus a good margin of saving, with none of the inefficiencies or limitations of VMs; actually if one buys 2,000 physical servers and colocation one can expect substial discounts over $25,000 for a single server over 5 years.

Note Those 2,000 servers are unlikely to have a large configuration, more likely to be small thin servers for purposes like running web front-ends.

The question is whether those who rent 2,000 VMs from AWS are mad. My impression is that they are more foolish than mad, and the key words in the story above are no one had to approve the cloud expenses. The point is not just lack of approval but that cloud VMs became the default option, the path of least resistance for every project inside the company.
The company probably started with something like 20 VMs to prototype their service to avoid investing in a fixed capital expense, and then since renting more VMs was easy and everybody did it, that grew by default to 2,000 with nobody really asking themselves for a long time whether a quick and easy option for starting with 20 systems is as sensible when having 2,000.

Cutting VM costs by 10-20% by improving capacity utilization is a start but fairly small compard to rolling their own.

I have written before that cloud storage is also very expensive, and cloud systems also seem to be. Cloud services seem to me premium products for those who love convenience more than price and/or have highly variable workloads, or those who need a builting content distribution network. Probably small startups are like that, but eventually they start growing slowly or at least predictably, and keep using cloud services by habit.

170202 Thu: A straightforward alternative to confinement mechanisms

There are two problems in access control, read-up and write-down, and two techniques (access lists and capabilities). The regular UNIX access control is aims to solve the read-up problem using an abbreviated form of access control lists, and POSIX added extended access control lists.

Preventing read-up with access control lists is a solution for preventing unintended access to resources by users, but does not prevent unintended access to resources by programs, or more precisely by the processes running those programs, because a process running with the user's id can access any resource belonging to that user, and potentially transfer the content of the resource to third parties such as the program's author, or someone who has managed to hack into the process running that program. That is, it does not prevent write-down.

The typical solution to confinement is use some form of container, that is to envelope a process running a program in some kind of isolation layer, that prevents it from accessing the resources belonging to the invoking user. The isolation layer can be usually:

There is a much simpler alternative (with a logic similar to distinct effective and real ids for inodes) that uses the regular UNIX/POSIX permissions and ACLs: to create a UNIX/POSIX user (and group) id per program, and then to allow access only if both the process owner's id and the program's id have access to a resource.

This is in effect what SELinux does in a convoluted way, and is fairly similar also to AppArmor profile files, which however suffer from the limitation of imposing policies to be shared by all users.

Instead allowing processes to be characterized by both the user id of the process owner, and the user id of the program (and similarly for group ids), would allow users to make use of regular permissions and ACL to tailor access by programs to their own resources, if they so wished.

Note: currently access is granted if the effective user id or the effective group id of a process owner have permission to access a resource. This would change to granting access if [the process effective user id and the program user id both have permission] or [the process effective group id and the program group id both have permission]. Plus some additional rules, for example that a program id of 0 has access to everything, and can only be set by the user with user id 0.

Note: of course multiple programs could share the same program user and group id, which perhaps should be really called foreign id or origin id.

170129 Sun: The mainframe development problem and OpenMPI

At some point in order to boost the cost and lifestyles of their executive most IT technology companies try to move to higher margin market segments, which usually are those with the higher priced products. In the case of mainframes this meant an abandonment of lower priced market segments to minicomputer suppliers. This created a serious skills problem: to become a mainframe system administrator or programmer a mainframe was needed for learning, but mainframe hardware and operating systems were available only in large configurations at very high price levels, and therefore used only for production.

IBM was particularly affected by this as they really did not want to introduce minicomputers with a mainframe compatible hardware and operating system, to make sure customers locked-in to them would not be tempted to fall back on a minicomputers, and the IBM lines of minicomputers were kept rigorously incompatible with and much more primitive than the IBM mainframe line. Their solution, which did not quite succeed, was to introduce PC-sized workstations with a compatible instruction set, to run on them a version of the mainframe operating system, and even to create plug-in cards for the IBM PC line, all to make sure that the learning systems could not be used as cheap production systems.

The IBM 5100 PC-sized mainframe emulator became an interesting detail in the story of time traveller John Titor.

The problem is more general: in order to learn to configure and program a system, one has to buy that system or a compatible one. Such a problem is currently less visible because most small or large systems are based on the same hardware architectures, the intel IA32 or AMD64 ones, and one of two operating system, MS-Windows or Linux, and a laptop or a desktop thus have the same runtime environment as a larger system.

Currently the problem happens in particular for large clusters for scientific computing, and it manifests particularly for highly parallel OpenMPI programs. In particular many users of large clusters develop their programs on their laptops and desktops, and these programs read data from local files using POSIX primitives, rather than using OpenMPI IO primitives. Thus the demand for highly parallel POSIX-like filesystems like Lustre that however are not quite suitable for highly parallel situations.

The dominant issue of such programs is that the issues that arise with them, mostly synchronization and latency impacting speed, cannot be reproduced on a workstation, even if it can run the program with OpenMPI or other frameworks. Many of these programs cannot even be tested on a small cluster, because their issues arise only at grand scale, and may be even different on different clusters, as they may be specific to the performance envelope of the target. It is not a simple problem to solve, and is a problem that limits severely the usefulness of highly parallel programs on large clusters, as it limits considerably their ecosystem.

170128 Sat: An interesting blog post on namespaces

The bottom of this site's index page has a list of sites and blogs similar to this in containing opinions about computer technology, mostly related to Linux and system and network engineering and programming, and I have been recently discovered a blog by the engineers and programmers offering their services via one of the main project based contract work sites.

The blog like every blog has a bit of a promotional role for the site and the contract workers it lists, but the technical content is not itself promotional, but has fairly reasonable and interesting contributions.

I have been particularly interested by a posting on using Linux based namespaces to achieve program and process confinement.

It is an interesting topic in part because it is less than wonderfully documented, and it can have surprising consequences.

The posting is a bit optimistic in arguing that using namespacing, it is possible to safely execute arbitrary or unknown programs on your server as there are documented cases of (moderately easy) programs breeaking out of containers and even virtual machines. But namespace do make it more difficult to do bad things and often raising the level of cost and difficulty achieves good-enough security.

Note: The difficulty with namespaces and isolation is that namespaces are quite complicated mechanisms that need changes to a lot of Linux code, and they are a somewhat forced retrofit into the logic of a POSIX-style system, while dependable security mechanism need to be very simple to describe and code. Virtual machine systems are however even more complicated and error prone.

The posting discusses process is, network interface, and mount namespaces, giving simple illustrative examples of code to use them, and briefly discusses also user id, IPC and host-name namespaces. Perhaps user id namespaces would deserve a longer discussion.

The posting indeed can serve as a useful starting point for someone who is interested in knowing more a complicated topic. It would be nice to see it complemented by another article on the history and rationale of the design of namespaces and related ideas, and maybe I'll write something related to that in this blog.

170124 Tue: A legitimate Unsolicited Commercial Email!

With great surprise I have received recently for the first time in a very, very long time a legitimate and thus non-spam unsolicited commercial email.

The reason why unsolicited commercial emails are usually considered spam is that they are as a rule mindless time wasting advertising, usually automated and impersonal, and come in large volume as a result. For the email I received it was unsolicited and commercial, but it was actually a reasonable business contact email specifically directed at me from an actual person who answered my reply.

Of course there are socially challenged people who regard any unsolicited attempt at contact as a violation, (especially from the tax office I guess :-)), but unsolicited contact is actually pretty reasonable if done in small doses and for non-trivial personal or business reasons.

It is remarkable how rare they are and that's why I use my extremely effective anti-spam wildcard domain scheme.

170120 Fri: What is the "Internet"?

While chatting the question was raised of what is the Internet. From a technical point of view that is actually an interesting question with a fairly definite answer:

There are IPv4 or IPv6 internets that are distinct from the two Internets, but usually they adopt the IANA conventions and have some kind of gateway (usually it needs to do NAT) to the two Internets.

Note: these internets use the same IPv4 or IPv6 address ranges as the two Internets, but use them for different hosts. Conceivably they could also use the same port numbers for different services: as port 80 has been assigned by IANA for HTTP service, a separate internet could use port 399. But while this is possible I have never heard of an internet that uses assigned numbers different from the IANA ones, except for the root DNS servers.

But it is more common to have IPv4 or IPv6 internets that differ from the two Internets only in having a different set of DNS root name server addresses, but are otherwise part, at the transport and lower levels, of the two Internets.

There is a specific technical term to indicate the consequences of having different sets of DNS root name servers, naming zone. Usually the naming zones for internets directly connected to the two Internets overlap and extend with those of the two Internets, they just add (and sometimes redefine or hide) the domains of the Internets.

Note: Both the IPv4 and the IPv6 Internets share the same naming zone, in the sense that the IPv4 DNS root servers and the IPv6 DNS root server serve the same root zone content by convention. This is not necessarily the case at deeper DNS hierarchy levels: it is a local convention whether a domain resolves to an IPv4 and IPv6 address that are equivalent as in being on the same interface and the same service daemons being bound to them.

170107 Sat: F2FS and Bcachefs

Two relatively new filesystem designs and implementations for Linux are F2FS and Bcachefs.

The latter is a personal project of the author of Bcache, a design to cache data from slow storage onto faster storage. It seems very promising, and it is one of the few with full checksumming, but it is not part yet of the default Linux sources, and work on it seems to be interrupted, even if the implementation of the main features seems finished and stable.

F2FS was initially targeted at flash storage devices, but is generally usable as a regular POSIX filesystem, and performs well as such. Its implementation is also among the smallest with around half or less the code size of XFS, Btrfs, OCFS2 or ext4:

   text    data     bss     dec     hex filename
 237952   32874     168  270994   42292 f2fs/f2fs.ko

Many congratulations to its main author Jaegeuk Kim, a random Korean engineer in the middle of huge corporation Samsung, for his work.

Since the work has been an official Samsung project, and F2FS is part of the default Linux sources, and is widely used on Android based cellphone and tablet devices, it is likely to be well tested and to have long term support.