Computing notes 2017 part one

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

170610 Sat: How challenging is a goal of 18MB/s per TB of storage, and latency

I have previously mentioned the issue of IOPS-per-TB and the related requirement from CERN for disk storage systems to deliver at least 18MB/s of parallel read-write per TB (where at least means worst-case and parallel means interleaved, at least on a single disk) and I was recently astonished to hear someone comment that is pretty easily satisfied given that hard disks have a transfer rate of 150MB/s, when I mentioned that disk larger than 1TB were not likely to achieve that, especially 4-10TB archival disks.

Then I noticed that person was using a MacBook Pro with a very high transfer rate flash SSD with very high IOPS rates and I suspect that detached him from a keen awareness of the performance envelope of rotating disk drives.

Part of the performance envelope of typical contemporary disk drives is that in purely sequential transfers they can do up to 150MB/s on the outer cylinders and up to 80MB/s on the inner cylinders (flash SSDs don't have this difference), and that average random positioning times are around 12ms (flash SSDs have small fractions of a millisecond at least fom reading).

Simultaneous read-write rates of at least 18MB/s (not per TB) require at least one random positioning per transfer, and at 80MB/s it takes around 12ms to transfer 1MB; therefore reading-writing sequentially two parallel streams with 1MB blocks, which are huge, gives around 25ms per block, for a total of 40MB/s, and it takes a read/write transfer size of at least 320KB to achieve at least 18MB/s, in pretty much optimal conditions, and let's say 0.5MB transfers to be have some margin for variability.

It then takes 3MB transfers to deliver 60MB/s parallel read-write suitable for a 3TB disk, and disks of 4TB cannot deliver 80MB/s except in purely single stream stream mode, that is in practice no disk larger than 3TB can deliver 18MB/s parallel read-write.

Note: That is a 1TB disk capable of 80 random IOPS needs 0.5MB transfers to deliver 18MB/s of parallel read-writes. Put another way, with 64KB transfers a storage unit needs at least 400 random IOPS per TB to deliver 18MB/s of parallel read-writes per TB, and no single disk drive can do that. With 16KB transfers at least 1,700 random IOPS are needed.

Not many applications can afford to read or write with transfers as large as 0.5MB and 3MB. And it is not just the application that must use those transfer sizes: the data on disk must be contiguous for at least the length of the transfer size.

Large transfer sizes also have a very big downside: starting latency for every transfer is driven by the size of the previous transfer, and completion latency by that of the current transfer. A 0.5MB transfer take 18ms to complete, and on a busy 1TB disk that means each transfer not only has a completion latency of 18ms, but also starts at least 18ms after being issued. An IO latency of 18ms is pretty huge. On a 3TB disk with 3MB blocks completion latency goes up to 50ms.

The cause of large latency is that random positioning has a latency of 12ms, and to amortize the cost of that one has to choose large IO sizes, and those add to the latency as it takes even more time to transfer larger blocks. It is possible to mitigate the latency to completion of the current transfer by delivering the large transfers incrementally as a sequence of smaller blocks, but the latency to starting the next transfer still is the full size of the previous transfer.

Also transfer sizes of 0.5MB per operation on a 1TB disk are quite challenging also because on a busy system many IO operations, in particular metadata and logging operations, use much smaller transfer sizes, and in particular opening a file requires usually at least one random short transfer.

It depends on workload and in particular on average IO operation size and average contiguity on disk, but sustaining 18MB parallel read-write per TB of capacity, that requires contiguous transfers of at least 0.5MB, seem to me quite a challenge even on 1TB drives.

Note: the above contains some simple arithmetic, but some familiarity with watching actual disk based storage systems should give a very solid intuitive understanding that 18MB/s parallel worst-case read-write per TB is a pretty demanding goal. I suspect that so many people now have desktops and laptops with flash SSDs that intuitive understanding is harder to develop.

As previously noted the IOPS-per-TB issue is per TB of data actually present, rather than of available capacity. An approach that seems astute to some people is to show-off the IO rate and low cost of a storage system using large cheap-per-TB disk drives when it is still far from full, for example when each drive has less than 1TB of data on it, even if its capacity is 4-10TB, because that 1TB of actual data:

It is likely to be entirely on the outer tracks that deliver the top 150MB/s transfer rate.
It is likely to be entirely within a narrow range of tracks, effectively short-stroking random accesses so that they take rather less than 12ms on average.
Data is likely to be mostly contiguous as the free list was mostly contiguos on an empty disk.

In effect in the beginning 4-10TB drives are behaving as if they were particularly fast 1TB drives. But when they fill up their observed average speed will fall to 1/4 or less than that of a when they had only 1TB of data. I have seen that happen quite a few times on storage systems setup by astute people.

170601 Thu Copyright violation in the UK, employment terms

Recent news are that copyright violations of any extent are now treated like theft in the UK, having become a criminal instead of civil matter:

it's now possible for copyright holders to pursue criminal cases for an infringer of any size

This reminded me of being asked some time ago to comment on some employment contract terms that included vesting all copyright in any works created by the employee during that contract to the employer. My opinion as a non-lawyer is that was intended to mean that every single email, photo, letter, blog posts, made by that individual belonged solely to their employers. This included of course every personal one, including family photos, private letters, public blog posts, etc.; my opinion as a non-lawyer is that whether that was enforceable was more likely than not, but also that it would be extremely expensive for an individual to contest that in court. If it were enforceable two consequences might happen in my opinion:

The employer might have an enforceable right to obtain all private emails, photos, etc. of that employee on demand, those being the employer's property; the employee might have no right to retain a copy unless specifically authorized by their employer. Refusal or omission of any material might be contract and copyright violation I suspect.
Any distribution to third parties, including family members, of any such employer property, might be contract and copyright violation if done without employer permission. That is, the act of sending a personal email might be contract and copyright violation.

Now in the past such consequences in my opinion would be fairly small, except for the cost of a court case, because if the matter was civil the employer would have to prove and quantify damages, and I reckon that any claim to damages for giving an employee's relatives a photo of say their family would be laughed out of court. But I reckon that the employee could still be subject to summary dismissal for contract violation.

But my impression is that if copyright violation of any type is a criminal offense, like stealing a sandwich from a shop is theft regardless of the amount involved, then an employer making employer-unautorized copies of a private email she has written by sending that email to her parents and siblings might well be punished with a jail sentence.

If that were a consequence, employers would have not just have a way to terminate employment for breach on contract any employee who sent a personal email, photo, letter, published a blog post without authorization, but also to prosecute and perhaps to jail them. That would be a very powerful way to put pressure on employees. Any claim that such clauses would never be used in that way is ridiculous: lawyers are as a rule required to make full use of any legal advantage they can find, and they are not shy.

As to that, the same employement terms contained fairly extreme non-compete clauses, even for a low-end back-office clerical job, not an executive or research job. My impression as a non-lawyer is that in the UK they are not enforceable; but if the employer were determined to bring suit, it would be very expensive to defend against it. In some jurisdictions even extreme non-compete clauses are enforceable.

Update 170611: Indeed I have recently read an article reporting a case where they were enforced, probably to as an example to others:

How Noncompete Clauses Keep Workers Locked In

Restrictions once limited to executives are now spreading across the labor landscape — making it tougher for Americans to get a raise.

170520 Sat: Hard links: not for GNU/Linux GUI users?

As I was trying out some GUI file managers (notably KDE's Dolphin and Konqueror) to rename and move files inside a directory in which I had let too many file name entries accumulate I noticed that the option to hard link a file was missing. That is oddd as hard linking is a pretty fundamental UNIX operation and very useful to have different names refer to the same file, so I did a web search and discovered a relevant bug report and this comment in it:

I must say that I'm strongly against such a feature. The entry "Clone" (or whatever it's called) will be useful for only a few, but confuse ~99% of all users.

Now those who like to create hard links will say "Just add an option for it", but if we add options for every little thing that a few people consider useful (please believe me, there are *lots* of such things),

The comment reveals that the microsoftization of the Linux ecosystem continues, and how deep it has become.

Note: an irony here is that they allow symbolic linking, but even if symbolic links resemble the shortcuts of MS-Windows they are quite different in semantics, risking more confusion.

Sort of like expected the dired module in EMACS has hard-linking, and Midnight Commander has it too. Fortunately. But I liked the ability to rename files inline. I will use writable dired mode instead.

170519 Mon: Stuck laptop CPU fan and temperatures

After looking again at the graph showing that the CPU temperature of my laptop being strictly dependent on CPU load I checked my laptop's collectd graphs and that was confirmed. Which was highly suspicious as at some point the fan should start operating and cool the CPU.

So I did a minimal amount of cooling conduit cleanup: first I hit the relevant area with a rubber mallet to shake it a bit (dust tends to clump) and then I blew hard through it, and used a hoover to aspire it out. The result is quite amazing, ands it is that now without much exception the CPU temperature is fixed pretty much at 50C except during periods of very high load:

The difference is large, and I suspect that for some reason my laptop's CPU must have been stuck. Looking at the collectd graph that must have happened more than a year ago. Also I have cleaned the inflow dust filter grilles of my excellent NZXT H230 desktop case and that by itself has got the temperature of the hard disks in it to drop by on average 3C (from an already rather low level):

Effect on temperature of filter cleaning

170514 Mon: The NSA "hoard" and WannaCry

The recent story of the leaking of the NSA hoard of tools exploiting MS-Windows bugs and the resulting use of that for a ransom-ware project is fairly amusing as well as some comments on it.

My impression is that it is not such a big deal: I would expect every major black-hat group to have a long library of exploits, even packaged as kits some of them resulting from accidental bugs in various software parts, and some of them well designed deliberate bugs created by the people the black-hat teams quite likely have infiltrated in major software companies and groups. The NSA hoard is probably one of several they have, and likely one of the least valuable.

Note: it is not a surprise that someone at Microsoft complained that the NSA had not reported the bug but most likely several other black-hat groups knew of the same bug and nobody reported it either. Because the job of black-hat groups is to use the bugs, not get them fixed, and accidental bugs (if that was an accidental bug) are particularly convenient to keep in a list because the deliberate bugs are too precious to use except in difficult cases.

Therefore one particular exploit kit is not that important in the black-hat world. Also because basic exploit kits relying on accidental bugs are in general not that important, because they are meant to be used towards generic targets: they are in general of value only statistically, that is against generic masses of target systems, a certain percentage of which will have some exploitable bug. Because of their genericity they are useful for gaining access from outside to inside systems, that is from the Internet to some private system.

Therefore as in the current situation their usefulness is for implanting on the target systems ransom-ware or spam-bot modules, rather than steal information or money directly. Also they tend to be a bit too obvious, so involving a high risk of discovery.

What is more worrying for people in my line of work that involves protecting server systems, are the tools involved in targeted black-hat activities, against specific sites or people. These tools are invariably based on some form of access from the inside to the outside, from a private system that has been compromised to the general internet. The difficulty of course is how to compromise a private system, and that is why both the relevant techniques are used for targeted access and also in general can or should be used for targeted access.

Note: some over-entusiastic person kept asking me for pointers to exploit kits for accessing bank computers and taking money from them. That is a really stupid idea because outside-to-inside exploit kits are the wrong approach, and bank computers that handle money flows are a harder target than most even for targeted access tools.
That's why black-hat organizations use exploit kits for generic personal computers hoping for statistical effectiveness (even a small percentage of exploited systems can be profitable) as to opportunities for installation of ransom-ware or spam-ware. It is rather bank executives who misdirect large flows of money quite legally by doing dodgy deals and become rich from the related bonuses. Paraphrasing what someone wise said, robbing a bank is much harder than embezzling from it.

All these techniques are exact replicas of traditional spy techniques (nihil novum sub soli) based on planting something and someone inside, and the way to achieve that is to have someone or something outside that gets taken inside.

Note: for some examples of a discussion of what can be done to add deliberate bugs as to hardware design there is a nice presentation by Bunnie Huang, Impedance Matching Expectations Between RISC-V and the Open Hardware Community given on 2017-05-10 at 11:00am at the 6^th RISC-V workshop.

Edward Snowden has published a catalogue of techniques and hardware devices that the NSA had available several years ago, typically based on communication devices that are difficult to see and detect and are planted into the target systems. These are planted into the system before the system is delivered inside, or by someone who takes them from outside to inside. Among those planted outside before the system is taken inside are the software bugs planted by agents of a black-hat organization at the source, while working for a hardware or software company.

These techniques and hardware devices rely on a large budget and protection from a state entity to be effective, but that's at available to the NSA and many other black-hat groups, both government sponsored or independent. Given that I think that defending against a targeted operation by a group at that level is practically impossible for almost any private individual or commercial organization, and that it is already extremely difficult to detect such operations.

In practice the best choice is to avoid being the target of such a high-budget operation. Since these operations are risky and expensive, they are only justified against particularly valuable targets. Not being a valuable target is in general a very desirable situation because typically security budgets in profit (in a wide sense) oriented organizations are massively undersized, even at valuable targets, as they are costs to be incurred without the promise of an immediate profit, just the possibility of minimising the costs of an attack maybe a long while later. Sometimes I commiserate the system administrators in organizations that handle money. I prefer to stick to research data (which is usually worthless without the accompanying metadata, such as the details of the experimental conditions), or public/publishable data like web content, for example advertising content.

170513 Sat: Preventing Linux partition scanning

Some user on IRC reported a very inconvenient situation: for some reason a disk to be accessed from Linux had a malformed partition table that crashed the kernel code that attempted to parse it. It would have been very easy to clear the relevant area of the disk, but that could not be done because by default Linux attempts to read the partition table as soon as a disk is connected.

This was obviously one of the many cases of violation of sensible practices in Linux design due to an excessive desire for automagic, and I could relate to it as I had to maintain system which used unpartitioned disks and printed on every boot a lot of spurious error messages about malformed partition tables (without crashing at least).

It is easy to verify that in the Linux kernel the rescan_partitions function is invoked unconditionally by the __blkdev_get function, and therefore it is impossible to avoid an attempt to find and parse a partition table. Ten years ago a simple patch was proposed to avoid scanning for a partition table at boot time, but it was not adopted.

So I thought about a workaround that ought to work: to use the boot parameter that tells the kernel that a specific block device has a given partition table using the boot kernel parameter blkdevparts. Since the code in rescan_partitions stops looking for a volume label when the first match is found this ought to work. Even if in the check_part table, which is organized alphabetically, the cmdline_partition section follows several checks for some variants of the Acorn volume label type, which should not be mostly a problem.

The overall issue is that because of the ever deepening microsoftization of GNU/Linux culture improper designs are made to be as automagic as possible: in this case the improperty is that the kernel autodiscovers all possible devices, and then autoactivates them, including partition scanning, and then upcalls the appallingly misdesigned udev subsystem to ensure devices are mounted or started as soon as possible.

The proper UNIX-style logic would be for the default to be inverted, that is for the kernel to initiate discovery of devices only on user request, and then to activate discovered devices again only on user request, and to mount or start them again only on user request. With the ability for the system startup scripts to assume those user requests on boot, but with the ability for the user to override the default. But then the Linux kernel designers have even forgotten to specify an abstract device state machine, so it does not seem surprising that an improper activation logic is used.

170506 Sat: Web pages and laptop heat

So I have previously remarked that many web pages keep the CPU busy and that laptops and handhelds are limited by heat and now I am too a double victim: in the past couple of weeks as I have been looking at news and forum sites like The Guardian my laptop (which I have been using as my main system) has shut down abruptly a few times because of overheating, caused entirely by CPU load from free-running Javascript in web pages.

This a graph (using psensor) of the effect of closing a JavaScript tab containing a Disqus forum threads on CPU usage and temperature:

This is in part due to having a somewhat older CPU chip that does not throttle its clock when it gets hotter, but that in a sense would be cheating: a nominally fast CPU that almost transparently becomes slow when it runs at its rated speed.

There is no workaround: like many I use browser extensions that disable JavaScript code (for example NoScript) but several sites insist on dynamic content generation or update via JavaScript, but once a site is given free access to a browser's JavaScript virtual machine they can take as much advantage as possible. Like many I think that executable content is a very bad idea for several reasons, but it is here to stay for a while. Let's hope that enough handheld users scald their hands or laps or suffer from extreme slowness because of executable content sites that they become less popular.

170427 Thu: NFS, GNOME and KDE issues with auto-mounting

I have long used auto-mounters like am-utils (once known as AMD) and recently the Linux-native autofs5 as a more UNIX-style alternative to complex and insecure schemes based on MS-Windows like udev-based mounting on attachment of a network device, as mounting on attempted access and is both safer and results on shorter mounting periods. Because of this I have written that script for autofs5 that turns /etc/fstab into an auto-mount map.

Since recently I upgrade my NFS server configuration I also decided to look at, being somewhat related, my auto-mounting situation, which has had two slightly annoying problems:

On the NFS server the nfs-kernel-server would not trigger anymore (it did up to some point) the auto-mounting of the exported filetrees when they were accessed, requiring static mounting.
On clients with GNOME or KDE once accessed an auto-mounted filetree stayed automounted, being never auto-unmounted.

The motivation in both cases is that I reckon it is a good practice to leave filetrees umounted as much as possible, and mount them only for the period where they are needed, because:

On a restart after a system crash filetrees that were unmounted are clean and no not need checking and recovery.
Each mount consumes system resouces both on the client and the server.
An unmounted filesystem cannot be damaged by accidental operations and does not clutter the main filetree.

My investigation about the NFS server Ganesha and auto-unmounting shows that Ganesha opens an exported filetree's top directory on startup, and this means it won't be mounted only if accessed by clients:

#  lsof -a -c ganesha.nfsd /fs/*/.
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
ganesha.n 27646 root    5r   DIR   0,43      918  256 /mnt/export1/.
ganesha.n 27646 root    6r   DIR   0,47       84  256 /mnt/export2/.
ganesha.n 27646 root    7r   DIR    8,8    16384    5 /mnt/export3/.
ganesha.n 27646 root    8r   DIR   8,38     4096    5 /mnt/export5/.
ganesha.n 27646 root    9r   DIR   0,51      106  256 /mnt/export6/.
ganesha.n 27646 root   10r   DIR   0,55      156  256 /mnt/export7/.

The problem with both GNOME and KDE is that their open-file menus and their file managers are not designed to be UNIX-like, where there is a single tree of directories with user-invisible mount points, but they are designed to resemble MS-Windows (and MacOS X) in having a main filetree and a list of volumes with independent and separate filetrees. The other systems do this because on those systems by default when a mountable volume is attached to the system it is mounted statically until it is detached, and they don't have a UNIX-like mount-on-access behaviour.

Because of this they open the system table of mountpoints and read a list of those mountpoints and access them periodically to get their status, and this of course triggers automounting and constant automounmting. This has caused a number of complaints (1, 2, 3, 4), and fortunately these have resulted in some ready-made solutions for GNOME, and similar ones for KDE:

GNOME solution

The problem is that GNOME uses a filesystem layer and daemons called GVFS that by default mounts all mount-points listed in system configuration files. This can be disabled in three ways:

Per user with:

gsettings set org.gnome.desktop.media-handling \
  automount false
gsettings set org.gnome.desktop.media-handling \
  automount-open false

Per mount-point by adding the relevant /etc/fstab line the flag x-gvfs-hide.
For the whole system change the default by creating a file under /etc/dconf/db/local.d/ containing:
```
[org/gnome/desktop/media-handling]
automount=false
automount-open=false
```
and then running dconf update.

KDE solution

KDE used the kio system, but that is not quite involved, instead there is an automounter daemon per-user, and some common plugins mount by default any available mount-points. To disable all these for a user:

In the KDE Runner settings ensure that the Devices plugin is disabled.
In the KDE Dolphin Places panel mark as hidden all mountpoints.
Run kcmshell4 device_automounter_kcm and enable Only automatically mount removable media that has been manually mounted before and disable the other two options, and Forget Device those listed under Device Overrides (not really sure about this).
Run kcmshell4 kcmkded and disable Removable devices automounter.
Click on the System Tray and select the settings and disable the Device Notifier.

There is a patch to disable mount-point auto-mounting entirely (but only for those of type autofs) but is is not in the KDE version I am using.

Update 180314 Added to the KDE steps to disable the Device Notifier in the System Tray Settings.

170424 Mon: Btrfs and NFS service and NFS daemon Ganesha

So I am using Btrfs, despite it having some design and implementation flaws in its more advanced functionality, because it is fairly reliable as a single device filesystem with checksums and snapshots.

Because of its somewhat unusual redirect-on-write (a variant of COW) design and its many features it has somewhat surprising corner cases, and one of them is that direct IO writes does not update the block checksums (and direct IO reading does not verify them).

Note: this does not mean that checksums are missing when block is written by direct IO: as the above mailing list post says, they are computed when the write is initiated, but the block can still be modified by the application after that, and the already-computed checksum is not updated, because the application has no way to indicate, but initiating a second write, that the block has been updated. That is direct IO does not support stable writes which is problem in any filesystem, but with Btrfs can result in the block and checksum not matching.
If the block is modified before the write actually starts, the block written to persistent storage will be actually correct and consistent, but the checksum will be wrong; if the block is modified while the write is in progress, it is possible for the write to contain part original block and part updated block, and the block will be consistent and the checksum wrong. Note that in the case of block in chunks with profiles that have redundancy more complicated cases can arise: for example for a block with a raid1 profile the block checksum can be computed, and the block can be written from memory before it is modified to one copy, and after it is modified to the other.
On reads the checksums are indeed checked, but no error is returned to the application if they fail to verify, only a warning is printed to the system log. This is because it is possible for the block to be correct and the checksum wrong.

Update 170813: An earlier report that non-stable writes can be an issue with nfs-kernel-server was unfortunately optimistic because of an unclear error report. When nfs-kernel-server reads from Btrfs volumes the block checksums are read and checked but not verified, and in case of check failure a warning is printed by no error is returned. That means that potentially NFS clients get corrupted block without any error indication, despite the presence of checksums, as reported in this IRC conversation:

2017-02-20 14:32:28 <multicore> darkling: with direct-io csums errors don't result in I/O errors
2017-02-20 14:33:01 <multicore> so do just a nfs share from a btrfs and feed gargage to your clients :/
...
2017-02-20 14:51:55 <Walex> Do direct IO *writes* update checksums?
2017-02-20 14:52:12 <Ke> yes
...
2017-02-20 14:52:28 <Walex> Ke: Ahh so only read don't check them.
2017-02-20 14:53:19 <Ke> Walex: not sure, but the problem is that pages that are modeified during the write will not get updated checksums accordingly
2017-02-20 14:53:53 <multicore> Walex: during DIO reads there's a csum error logged but it doesn't result in EIO
2017-02-20 14:54:25 <Walex> Ke: so direct IO writes don't actually update checksums. My question was not about EIO, it was to update the "Gotchas" section
2017-02-20 14:54:57 <Walex> Is it fair to say that Direct IO reads and writes simply *ignore* checksums?
2017-02-20 14:55:03 <Ke> they do update them to match whatever they read when they read it
2017-02-20 14:55:32 <Ke> when writing that is
2017-02-20 14:55:47 <Ke> the page is not write protected during DIO
2017-02-20 14:55:57 <Ke> it can be modified at any time
...
2017-02-20 14:56:23 <Walex> Ke: does direct IO read verify checksums?
2017-02-20 14:56:28 <Ke> no idea
...
2017-02-20 14:56:46 <multicore> Walex: yes
2017-02-20 14:57:40 <Walex> multicore: so direct IO read and write do the checksum thing, but a block written with direct IO may be modified after the checksum has been computed.
2017-02-20 14:57:52 <Walex> modified in memory I mean.
2017-02-20 14:58:25 <Walex> so the issue is "stable" writes I guess.
2017-02-20 14:58:46 <multicore> Walex: i don't know
2017-02-20 14:58:47 <Walex> I am asking actually, not stating this.
2017-02-20 14:59:53 <Walex> multicore: because if the NFS kernel server does direct IO to Btrfs files, then what happens to checksums during direct IO is a rather big issue.
2017-02-20 15:05:20 <multicore> Walex: i was saved by logcheck because it mailed me about the csum error, had to delete the file from nfs server + client and recover from backups
...
2017-02-20 15:08:23 <multicore> Walex: i was told by someone that with raid1 + csum error the error is fixed but i haven't verified it...
2017-02-20 15:08:34 <multicore> Walex: ...with DIO...
...
2017-02-20 15:11:03 <multicore> this is the reason why i was asking: "<multicore> is there a patch around that turns DIO csum errors to EIO instead just logging the error?"

Note: it is not clear why the nfs-kernel-server checks but does not verify checksums. Perhaps when it writes there is a stable writes issue similar to that arising from direct IO. Also, in both the direct IO and nfs-kernel-server read cases, if a checksum fails on read, and there is enough redundancy to reconstruct valid block, it will be reconstructed. But may fail in the worst cases, where the block is valid but the checksum is wrong, or where there are two copies of the block, but the write was not stable on only one of them.

The obvious solution for NFS access to Btrfs volumes is to use the NFS daemon server Ganesha which runs like the SMB/CIFS daemon server Samba as a user process using ordinary IO, like many server daemons for other local and distributed filesystems. A kernel based server is going to have lower overheads, but for a file server the biggest costs are in storage and in network transfers, and running in user mode is comparatively a very modest cost.

Ganesha is something that I kept looking at, and years ago it was not quite finished and awkward to install and poorly documented. It is currently instead fairly polished, it is a standard package for Ubuntu, and newer versions are easily obtained via a fairly well maintained Ubuntu PPA.

It is still not very well documented, and in particular the existing example configurations are overly simplistic, as well as with some default paths not quite right for my Ubuntu 14 system, so here is my own example:

# /usr/share/doc/nfs-ganesha-doc/config_samples/config.txt.gz
# /usr/share/doc/nfs-ganesha-doc/config_samples/export.txt.gz
# https://github.com/nfs-ganesha/nfs-ganesha/wiki/Ganesha-config

NFS_CORE_PARAM
{
  # https://github.com/nfs-ganesha/nfs-ganesha/wiki/CoreParams

  Nb_Worker		=32;

  # Compare the port numbers with those in 
  # /etc/default/nfs-kernel-server (MNT_Port)
  # /etc/default/quota (Rquota_port)
  # /etc/modprobe.d/local.conf (NLM_Port)
  #   options lockd nlm_udpport=893 nlm_tcpport=893
  # /etc/sysctl.conf (NLM_Port)
  #   fs.nfs.nlm_tcpport=893
  #   fs.nfs.nlm_udpport=893

  NFS_Protocols		=4;
  NFS_Port		=2049;
  NFS_Program		=100003;
  MNT_Port		=892;
  MNT_Program		=100005;

  Enable_NLM		=true;
  NLM_Port		=893;
  NLM_Program		=100021;

  Enable_RQUOTA		=true;
  Rquota_Port		=894;
  Rquota_Program	=100011;

  Enable_TCP_keepalive	=false;
  Bind_addr		=0.0.0.0;
}

NFSV4
{
  DomainName		="WHATEVER";
  IdmapConf		="/etc/idmapd.conf";
}

NFS_KRB5
{
  Active_krb5		=true;
  # This could be "host" to recycle the host principal for NFS.
  PrincipalName		="nfs";
  KeytabPath		="/etc/krb5.keytab";
  CCacheDir		="/run/ganesha.nfs.kr5bcc";
}

EXPORT_DEFAULTS
{
  Protocols		=4;
  Transports		=TCP;

  Access_Type		=NONE;
  SecType		=krb5p;
  Squash		=Root_Squash;
}

EXPORT
{
  Access_Type		=RW;
  SecType		=krb5p;
  Squash		=Root_Squash;

  Export_ID		=1;
  FSAL			{ Name="XFS"; }
  # Must not have symlinks
  Path			="/var/data/pub";
  Pseudo		="/var/data/pub";

  PrefRead		=262144;
  PrefWrite		=262144;
  PrefReadDir		=262144;
}

Two of the interesting aspects of Ganesha is that it can serve not just NFSv3 and NFSv4, but also NFSv4.1, pNFS and even the 9p protocol and has specialized backends called FSAL for various type of underlying filesystems, notably CephFS and GlusterFS but also various other cluster filesystems, plus XFS and ZFS. The backends for the cluster filesystems give direct access to those filesystems without the need of a local client for them, which reduces considerably complications and overheads.

My experience with Ganesha has been so far very positive, the speed is good as expected, and it is rather easier to setup and configure than nfs-kernel-server in particular as to Kerberos authentication; also as to configuration, administration, monitoring and investigation, being a user-level daemon.

170417 Mon: Remote two factor authentication and passwords and passphrases

Remote two-factor authentication requires knowledge or possession of two authenticators, usually one is known and the other is possessed. The obvious example is a password and a mobile phone number that received a once-only validation code, or a password and a random one-time-password from a pre-generated list.

However in many cases there is a weaker form of two-factor authentication, where the authenticator unlocks some container holding the effective authenticator, because in that case access requires both knowledge of the authenticator and possession of the container. This case is common where the authenticator is the passphrase of some encrypted file containing the real authenticator, for example an SSH key passphrase and a the SSH private key file it encrypts.

But there is a more common case that should not be forgotten: possession of the computer itself, that is access can only be local and physical. This does not mean that the computer is only accessible locally, but that the authenticator only works locally.

Now these cases seem fairly obvious, but there is an important case where the distinction matters: when providing an authenticator based on knowledge and it can be observed by third parties, such as when a key-logger is present on the system, or a camera or microphone records the typing of the authenticator. In these cases the authenticator can be recorded. For organizations with resources putting a tiny key-logger in the target system or a tiny camera or a tiny microphone in the room containing can be pretty easy for regular offices and residences, but there is a far more common case: use of a system, such as a laptop, in public, such as on a train, bus, or in a coffee house or internet cafè, because surveillance cameras are extremely common.

Only explicit or implicit two-factor authentication helps against that. If that is a concern all remote accesses must involve two-factor authentication, and all one-factor accesses must be strictly local.

So for example on a UNIX/Linux systems passwd authentication should only be allowed for local log-in (getty etc.) and only key based authentication should be allowed for remote log-in (SSH etc.). Also, the local login password can be further augmented, in case loss of possession of the system is a concern, with an OTP system.

Note: Indeed for quite critical systems a site I know uses password plus OTP for local logins, and manually verified SSH keys for remote access.

In particular where is a concern with observation of passwords, Kerberos (and this includes MS-AD) passwords should not be allowed for remote log-in, because the password can be used then to get remote access entirely by itself; also kinit by the same argument should not be used with just a password.

Fortunately some provision has been made for two-factor authentication using Kerberos and smartcards/certificates (PKINIT) and tentative approaches with OTP (FAST, PAKE).

There is an alternative: to extract from Kerberos a suitable keytab and put it into an encrypted file. Then knowing the passphrase for the encrypted keytab is needed to use it with kinit, and possession of that passphrase and the keytab are both necessary. This option has some downsides, and should be carefully considered.

In theory all passwords and passphrases should be unique, but for primary login (whether local or remote) passwords that is quite tedious. The local login password and the passphrases used to encrypt local copies of the SSH keys (and of the Kerberos keytab) can be the same, as they all require possession of the system or encrypted file; the OTP and Kerberos passwords however need to be different, or else those who figure out the login passwords can use them to achieve local access when possession of the system has been obtained, or remote access via Kerberos.

Note: if local or remote login is achieved by an adversary, all encryption passphrases can be trivially obtained on first use, by installing a key-logger daemon.

170407 Thu: Unusual filesystem properties

I have recently learned that in Btrfs the number of links per directory (as returned by the stat(2) system call) is always 1, and this number has the special meaning that it is not known how many subdirectories a directory has, as a directory always has at least 2 sudirectories. Also the number of inodes maximum and free in a Btrfs filetree is always reported as 0 (as returned by the statfs(2) system call), which has the special meaning that the actual number is not known, while a filetree has always at least 1 inode for the local root directory.

These are significant changes in the semantics of UNIX-style filesystems, and other filesystem types have adopted them. It makes certainly easier to adopt filesystem implementations in which hard-links and directories are implement differently from the tradition in which a directory is a simple list of hard links which consist of link name and inode number.

The relevant code within the popular findutils collection and some other has been updated, but there is a lack of documentation of the special meaning of a link count of 1 for directories or 0 for maximum inodes and free inodes.

They are fairly significant changes, because they are obviously not backwards compatible: other filesystems with fluid statistics use the largest number, which is backwards compatible, becaue both number of links to a directory and number of free or user inodes are used as upper limits in loops that also rely on conditions.

170324 Fri: IPv6 reaches 14% of Google traffic, SixXS will close down

Three years ago IPv6 traffic was seen by Google as hitting 2% of total IP traffic and it is quite remarkable that today it hit 14.22%, and that is 14% of a rather bigger total. It is also remarkable that:

The growth of the percentage seems to have accelerating until 2015, while growth looks linear since then, with a gain of 4% per year.
The per-country distribution is far from uniform, for example Germany with 28%, Greece with 30%, USA with 30%, while UK, France, Canada, Japan, India, Malaysia, Finland, Norway, Portugal, Brazil are fairly close to the 14% average.
That Google sees 30% of IPv6 traffic in the USA, where IPv4 have been very abundantly allocated, and just 2.71% in China, where they are very scarce, is puzzling.
The graph from 2008 to 2012 shows that 1% was reached at the end of 2012, and that transition technologies like 6to4 and Teredo were a significant part of IPv6 traffic then, but have rapidly fallen to a minuscule share.

At 14% of total traffic IPv6 is now commercially relevant, in the sense that services need to be provided over IPv6, as IPv6 users cannot be ignored, and accordingly even mass market ISPs like Sky Broadband UK have been giving IPv6 connectivity as standard for several months and it is therefore sad but understandable that transition services like SixXS are being closed down:

Summary

SixXS will be sunset in H1 2017. All services will be turned down on 2017-06-06, after which the SixXS project will be retired. Users will no longer be able to use their IPv6 tunnels or subnets after this date, and are required to obtain IPv6 connectivity elsewhere, primarily with their Internet service provider.

My current ISP does not yet support IPv6, but it has a low delay to a fairly good 6to4 gateway, so I continue to use 6to4 with NAT.

170302 Thu: Some coarse speed tests with Btrfs etc. and small files

In the previous note about a simple speed test of several Linux filesystems star was used because unlike GNU tar it does fsync(2) before close(2) of a written file, and I added that for Btrfs I also ensured that metadata was (as per default) duplicated as per the dup profile, and these were challenging details.

To demonstrate how challenging I have done some further coarse tests on a similar system, copying from a mostly root filesystem (which has many small files) to a Btrfs filesystem, with and without the star option -no-fsync and with Btrfs metadata single and dup in order of increasing write rate with fsync:

Btrfs target	write time	write rate	write sys CPU
`dup` `fsync`	381m 28s	3.3MB/s	13m 14s
`single` `fsync`	293m 26s	4.2MB/s	13m 15s
`dup` no-`fsync`	20m 09s	61.7MB/s	3m 50s
`single` no-`fsync`	19m 58s	62.3MB/s	3m 41s

For comparison the same source and target and copy command with some other filesystems also in order of increasing write rate with fsync:

other target	write time	write rate	write sys CPU
XFS `fsync`	318m 49s	3.9MB/s	7m 55s
XFS no-`fsync`	21m 24s	58.2MB/s	4m 10s
F2FS `fsync`	239m 04s	5.2MB/s	9m 09s
F2FS no-`fsync`	21m 32s	57.8MB/s	5m 02s
NILFS2 `fsync`	118m 27s	10.5MB/s	7m 52s
NILFS2 no-`fsync`	23m 17s	53.5MB/s	5m 03s
JFS `fsync`	21m 45s	57.2MB/s	5m 11s
JFS no-`fsync`	19m 42s	63.2MB/s	4m 24s

Notes:
Source filetree was XFS on a very fast flash SSD, so not a bottleneck.
Source filetree 70-71GiB (74-75GB) and 0.94M inodes (0.10M of which directories), of which 0.48M under 1KiB, 0.69M under 4KiB and 0.8M under 8KiB (the Btrfs allocation block size).
Observed occasionally IO with blktrace and blkparse and with fsync virtually all IO was synchronous, as fsync on a set of files of 1 block is essentially the same as fsync on every block.
Because of its (nearly) copy-on-write nature Btrfs handles transactions with multiple fsyncs particularly well.

Obviously fsync per-file on most files is very expensive, and dup file metadata is also quite expensive, but nowhere as much as fsync. A very strong demonstration that the performance envelope of storage system is rather anisotropic. It is also ghastly interesting that given the same volume of data IO with fsync costs more than 3 times the system CPU time as without. Some filesystem specific notes:

F2FS at the end has allocated around 84GiB to store a 70-71GiB filetree, and this is probably due to internal fragmentation in writing less than fully utilized extents on fsync.
JFS cheats a bit (and this has to be kept in mind for some applications) with the 5 second commit window, but it seems to be a very effective cheating; indeed looking at IO traffic with blkstat blkparse shows that synchronous writes are a significant minority even with fsync, where the other filesystems almost only made synchronous writes with fsync. The write rate for the no-fsync case is very impressive for so many small files.

NILFS2 does pretty well with fsync comparatively, I guess because it interleaves data and metadata in log segments. But it also cheats in a resource sense: after the fsync copy the space allocated is much higher than the space used, at 136GiB for 70-71GiB of data:

soft#  dl /mnt/tmp/.
Filesystem     Type   1G-blocks  Used Available Use% Mounted on
/dev/sdd3      nilfs2      232G  136G       85G  62% /mnt/tmp
soft#  du -sm /mnt/tmp/.
72670   /mnt/tmp/.

Looking at its operation with vmstat during the fsync copy:

 0  0      0 272008 2232692 3826540    0    0  3128 23980 4422 3293  0  2 96  1  0
 1  0      0 270892 2233596 3827288    0    0     0 12448 1323 2041  0  1 99  0  0
 1  0      0 269900 2234488 3828084    0    0     0 12416 1398 2079  0  1 99  0  0
 0  0      0 267296 2235392 3829128    0    0  3492 13068 2385 3288  0  2 96  2  0
 0  0      0 265636 2236268 3829932    0    0     0 12072 1243 2003  0  1 98  0  0
 0  0      0 262968 2237172 3830992    0    0     0 12384 1297 2107  0  1 99  0  0
 0  0      0 261232 2238072 3831932    0    0     0 12192 1335 2078  0  1 99  0  0
 0  0      0 253848 2238976 3836124    0    0  3888 12396 3024 4276  0  2 96  2  0
 0  0      0 251816 2239872 3837088    0    0     0 12232 1334 2110  0  1 98  0  0
 0  0      0 248512 2240772 3838312    0    0     0 19060 3214 2853  1  2 97  0  0
 0  0      0 241596 2241680 3842384    0    0  3520 12136 3070 4348  0  2 96  2  0
 0  0      0 239300 2242580 3843444    0    0     0 12192 1326 2087  0  2 98  0  0
 0  0      0 238152 2243480 3844088    0    0     0 11944 1278 2030  0  1 99  0  0
 0  0      0 236104 2244384 3845212    0    0     0 12292 1419 2192  0  2 98  0  0
 0  0      0 229360 2245284 3849200    0    0  3476 11900 2828 3970  0  2 96  2  0

That is NILFS2 writes to the target filetree a lot more than what is being read from the source filetree. Most likely the reason is that on fsync NILFS2 writes to disk a full log segment even if it has not been filled up. Starting the NILFS2 compactor nilfs_cleanerd on the target filetree brings down the allocated space to be in line with the used space, and perhaps the time taken by the compactor (roughly an extra 30m) to do that should be added to the time to complete the copy.

170228 Tue: Some coarse speed tests with various Linux filesystems

Having previously mentioned my favourite filesystems I have decided to do again in a different for a rather informative, despite being simplistic and coarse, test of their speed similar to one I did a while ago, with some useful results. The test is:

Take a 640GB (Btrfs) filetree in a 973GB partition in the lower half of fast-ish 2TB 7200RPM disk drive, and a newly filetree formatted partition at the same place on a cheap-ish 2TB 5900RPM 2TB drive. The 640GB is mostly media files, photos including large panoramas (1MB to 200MB), movies (10MB to 4GB VOBs), GNU/Linux distribution ISOs (1GB to 4GB), and a collection of articles about GNU/Linux (many HTML and small image files). It is mostly largish files, but the collection of articles has many small files, and while handling it I can see IOPS going down dramatically.
First copy-by-tar from the source to the target partition, with this command:
```
time star -copy -p -xdot -C /mnt/media . /mnt/tmp
```
The command chosen is star because unlike most other tar variants it actually fsyncs the file, which to me is quite important to figure out a filesystem's proper profile. It also conveniently has much like pax or cpio a copy-mode.

After doing this, drop the caches, and read back the data just copied with:

sysctl vm/drop_caches=1; time star -f - -b 512 -c -p -xdot -C /mnt/tmp . | dd bs=512b of=/dev/zero

The system, as it should, is configured to run the flusher often and to flush often and vigorously with:
```
#  grep -H . /proc/sys/vm/dirty*{bytes,centisecs}*
/proc/sys/vm/dirty_background_bytes:200000000
/proc/sys/vm/dirty_bytes:900000000
/proc/sys/vm/dirty_expire_centisecs:1000
/proc/sys/vm/dirty_writeback_centisecs:100
```
It would have been better with a value of dirty_background_bytes and dirty_expire_centisecs half of those, but I wanted to give a slightly better chance to (misguided) techniques like delayed allocation.

It is coarse, it is simplistic, but it gives some useful upper bounds on how filesystem does in a fairly optimal case. In particular the write test, involving as it does a fair bit of synchronous writing and seeking for metadata, is fairly harsh; even if, as the test involves writing to a fresh, empty filetree, pretty much an ideal condition, it does not account at all for fragmentation on rewrites and updates. The results, commented below, sorted by fastest, in two tables for writing and reading:

type	write time	write rate	write sys CPU
JFS	148m 01s	72.4MB/s	24m 36s
F2FS	170m 28s	62.9MB/s	26m 34s
OCFS2	183m 52s	58.3MB/s	36m 00s
XFS	198m 06s	54.1MB/s	23m 28s
NILFS2	224m 36s	47.7MB/s	32m 04s
ZFSonLinux	225m 09s	47.6MB/s	18m 37s
UDF	228m 47s	46.9MB/s	24m 32s
ReiserFS	236m 34s	45.3MB/s	37m 14s
Btrfs	252m 42s	42.4MB/s	21m 42s

type	read time	read rate	read sys CPU
F2FS	106m 25s	100.7MB/s	66m 57s
Btrfs	108m 59s	98.4MB/s	71m 25s
OCFS2	113m 42s	94.3MB/s	66m 39s
UDF	116m 35s	92.0MB/s	66m 54s
XFS	117m 10s	91.5MB/s	66m 03s
JFS	120m 18s	89.1MB/s	66m 38s
ZFSonLinux	125m 01s	85.8MB/s	23m 11s
ReiserFS	125m 08s	85.7MB/s	69m 52s
NILFS2	128m 05s	83.7MB/s	69m 41s

Notes:
The system was otherwise quiescent.
I have watched the various tests with iostat, vmstat, and looking at graphs produced by collectd, displayed by kcollectd and sometimes I have used blkstat blktrace; I have also used occasionally used strace to look at the IO operations requested by star.
Having looked at actual behaviour, I am fairly sure that all involved filesystem respected fsync semantics.
The source disk seemed at all times to not be the limiting factor for the copy, in particular as streaming reads are rather faster, as shown above, than writes.

The first comment on the numbers above is the obscene amount of system CPU time taken, especially for reading. That the system CPU time taken for reading being 2.5 times or more higher than that for writing is also absurd. The test system has an 8 thread AMD 8270e CPU with a highly optimized microarchitecture, 8MiB of level 3 cache, and a 3.3GHz clock rate.

The the system CPU time for most filesystem types is roughly the same, again especially for reading, which indicates that there is common cause that is not filesystem specific. For F2FS the system CPU time for reading is more than 50% of the elapsed time, an extreme case. It is interesting to see that ZFSonLinux, which has uses its own cache implementation, ARC has a system CPU time of roughly 1/3 that of the others.

That Linux block IO involves an obscene amount of system CPU time to do IO I had noticed already over 10 years ago and that the issue has persisted so far is a continuing assessment of the Linux kernel developers in charge of developing the block IO subsystem.

Another comment that applies across all filesystems is that the range of speeds is not that different, all of them had fairly adequate, reasonable speeds given the device. While there is a range of better to worse, this is to be expected from a coarse test like this, and a different ranking will apply to different workloads. What this coarse test says is that none of these filesystems is is particularly bad on this, all of them are fairly good.

Another filesystem independent aspect is that the absolute values are much higher, at 6-7 times better, than those I reported only four years ago. My guess is is this mostly because the previous test involved the Linux source tree which contains a a large number of very small files; but also because the hardware was an old system that I was no longer using in 2012, indeed I had not used since 2006.

As to the selection of filesystem types tested, the presence of F2FS, OCFS2, UDF, NILFS2 may seem surprising, as they are considered special-case or odd filesystems. Even if F2FS was targeted at flash storage, OCFS2 at shared-device clusters, UDF at DVDs and BDs, and NILFS2 at "continuous snapshotting", they are actually all general purpose, POSIX-compatible filesystems that work well on disk drives and with general purpose workloads. I have also added ZFSonLinux, even if I don't like for various reasons, as a paragon. I have omitted a test of ext4 because I reckon that it is a largely pointless filesystem design, that exists only because of in-place upgradeability from ext3, which in turn was popular only because of in-place upgradeability from ext2, when the installed base of Linux was much smaller. Also OCFS2 has a design quite similar to that of the ext3 filesystem, and has some more interesting features.

Overall the winners, but not large margins, from this test seem to be F2FS, JFS, XFS, OCFS2. Some filesystem specific notes, in alphabetical order:

Btrfs was second best for reading speed and worst for writing speed. I suspect that the latter is because it was made with the default option of duplicated metadata blocks, to reduce the impact of checksum failure damage amplification. The slighly higher system-CPU time than most is probably due to checksum verification during reading.
F2FS was fastest at both writing and reading. Obviously its log-structured design helps on writing, and in general it seems good at keeping files contiguous. Probably a bit too contiguous, so I suspect it won't behave as well on rewrites and updates.
JFS behaved as usually pretty well at most everything. I suspect that part of its write rate is due to a small degree of cheating, as it does not commit updated metadata immediately, but only after five seconds, as looking at the collectd graph for JFS there isn't the typical large dip as in this other collectd graph for F2FS when writing the many small files part of the test archive.
Looking at the behaviour of JFS while reading shows also the opposite of cheating: when writing JFS tries to spread written data among allocation groups, which makes reading slower in this test than for a filesystem that packs tightly all data added to an empty filetree. The advantage of the JFS design is further additions to the filetree have a better chance of being nearer to existing data, improving long term locality.
NILFS2 behaved averagely well (BTW I mounted it with nogc). As reported before for some reason it is a bit slower than others at reading, I suspect because the metadata is interleaved with the data, and has to be skipped over during reading.
OCFS2 also behaved as usual pretty well, as a distant derivative of ext3 like ext4. Notably I run the test with metadata checksums enabled. Note that is a questionable choice because of previously noted checksum failure damage amplification. Its code size is among the largest.
ReiserFS behaved reasonably but was second slowest in both writing and reading. Presumably because it tends to work better on small files that the mostly large files in this test's workload.
UDF is usually associated with preparing images to burn to read-only images for it can be very well used for read-write hard disks, and performed pretty well in this test.
XFS performed averagely well, or even a bit better than average, and is another reasonable general purpose filesystem, but its implementation is huge and complex (at some point there was a report that the XFS source code contained five different B-tree implementations).
Some time ago it was reported as one of the slowest at creating and deleting files, because it strictly respected POSIX semantics to commit every metadata update. As a result it now arguably slightly cheats by buffering and then journaling metadata updates, therefore I have used options to limit that, as in logbufs=2,logbsize=16384 when mounting.
ZFSonLinux is a port and variant of OpenZFS to Linux, and OpenZFS is a variant of the now proprietary ZFS. Like OCFS2 and Btrfs it does checksumming, but does not allow single-volume replication to minimize the impact of checksum failure damage amplification. It performed indifferently, not particularly good at writing or reading. It had much lower system-CPU utilization as it uses its own buffer cache system, for which memory is reserved specifically. In this test I used a maximum 2GiB allocation for it.

170214 Mar: New Seagate disc drives and their declared duty cycles

Some time ago I mentioned an archival disk drive model from HGST which had a specific lifetime rating for total reads and writes, of 180TB per year as compared to the 550TB per year of a similar non-archival disk drive.

The recent series of IronWolf, IronWolf Pro and Enterprise Capacity disk drives from Seagate which are targeted largely at cold storage use have similar ratings:

IronWolf: 180TB/year (with MTBF of 1 million hours).
IronWolf Pro: 300TB/year (with MTBF of 1.2 million hours).
Enterprise Capacity: 550TB/year (with MTBF of 2 million hours).

It is interesting also that the lower capacity drives are rated for Nonrecoverable Read Errors per Bits Read, Max of 1 per 10E14 and the large ones for 1 in 1E15 (pretty much industry standard), because 180TB per year is roughly 10E13, and the drives probably are designed to last 5-10 years (they have 3-5 years warranties).

170205 Sun: A straightforward alternative to the setuid mechanism too

Having just illustrated a simple confinement mechanism for UNIX/Linux systems that uses the regular UNIX/Linux style permissions, I should add that the same mechanism can also replace in one simple unified mechanism the setuid protection domain switching of UNIX/Linux systems. The mechanism would be to add to each process, along with its effective id (user/group) what I would now call a preventive id with the following rules:

The access given to a program is that common to both the effective id and the preventive id (the intersection of the permissions for to the effective and preventive ids), which can be no access.
Both effective and preventive id are inherited on fork.
On exec the preventive id (user/group) of a process is set to the id of the executed file.
Files are created as in regular UNIX/Linux semantics with the effective id of the creating process.
A program in a process may set the preventive id to the same value as the effective id (or to any value if the preventive id is zero). This results in the current UNIX/Linux non-set-id semantics.
A program in a process may set the effective id to the same value as the preventive id (or to any value if the effective id is zero). This results in the the current UNIX/Linux set-id semantics.
If the effective id of a process and its preventive id are different, the process is confined to the set of resources accessible by both. Therefore a user that does not fully trust an executable can give access to just the resources it strictly needs to access, by setting permissions so that the id of the file containing the executable can access only those.

Note: there are some other details to take care of, like apposite rules for access to a process via a debugger. The logic of the mechanism is that it is safe to let a process operate under the preventive id of its executable, because the program logic of the executable is under the control of the owner of the executable, and that should not be subverted.

The mechanism above is not quite backwards compatible with the UNIX/Linux semantics because it makes changes in the effective or preventive ids depend on explicit process actions, but it can be revised to be backwards compatible with the following alternative rules:

Only if exec if for an executable file with the sticky bit set the preventive id of the process is set to the id of that executable file. The sticky bit in effect becomes the confinement bit.
If exec is for an executable file with the set-id (user/group) bit set, then the effective id of the process is set to the preventive id after this has been set to the id of the executable file.
This is probably not strictly necessary because almost all system-provided executables on a typical UNIX/Linux system are in files owned by id 0, so preventive ids would be 0 thus resulting in no confinement like in traditional UNIX/Linux semantics.

Note: the implementation of either variant of the mechanism is trivial, and in particular adding preventive id fields to a process does not require backward incompatible changes as process attributes are not persistent.

The overall logic is that in the UNIX/Linux semantics for a process to work across two protection domains it must play between the user and group ids; but it is simpler and more general to have the two protection domains identified directly by two separate ids for the running process.

170204 Sat: Aggregate cost of AWS servers

BusinessWeek has an interesting article on some businesses that offer tools to minimize cloud costs and a particularly interesting example they make contains these figures:

Proofpoint rents about 2,000 servers from Amazon Web Services (AWS), Amazon.com’s cloud arm, and paid more than $10 million in 2016, double its 2015 outlay. “Amazon Web Services was the largest ungoverned item on the company’s budget,” Sutton says, meaning no one had to approve the cloud expenses.

$10m for 2,000 virtual machines means $5,000 per year per VM, or $25,000 over the 5-year period where a physical server would be depreciated. That buys a very nice physical server and 5 years of colocation including power and cooling and remote hands, plus a good margin of saving, with none of the inefficiencies or limitations of VMs; actually if one buys 2,000 physical servers and colocation one can expect substial discounts over $25,000 for a single server over 5 years.

Note Those 2,000 servers are unlikely to have a large configuration, more likely to be small thin servers for purposes like running web front-ends.

The question is whether those who rent 2,000 VMs from AWS are mad. My impression is that they are more foolish than mad, and the key words in the story above are no one had to approve the cloud expenses. The point is not just lack of approval but that cloud VMs became the default option, the path of least resistance for every project inside the company.
The company probably started with something like 20 VMs to prototype their service to avoid investing in a fixed capital expense, and then since renting more VMs was easy and everybody did it, that grew by default to 2,000 with nobody really asking themselves for a long time whether a quick and easy option for starting with 20 systems is as sensible when having 2,000.

Cutting VM costs by 10-20% by improving capacity utilization is a start but fairly small compard to rolling their own.

I have written before that cloud storage is also very expensive, and cloud systems also seem to be. Cloud services seem to me premium products for those who love convenience more than price and/or have highly variable workloads, or those who need a builting content distribution network. Probably small startups are like that, but eventually they start growing slowly or at least predictably, and keep using cloud services by habit.

170202 Thu: A straightforward alternative to confinement mechanisms

There are two problems in access control, read-up and write-down, and two techniques (access lists and capabilities). The regular UNIX access control is aims to solve the read-up problem using an abbreviated form of access control lists, and POSIX added extended access control lists.

Preventing read-up with access control lists is a solution for preventing unintended access to resources by users, but does not prevent unintended access to resources by programs, or more precisely by the processes running those programs, because a process running with the user's id can access any resource belonging to that user, and potentially transfer the content of the resource to third parties such as the program's author, or someone who has managed to hack into the process running that program. That is, it does not prevent write-down.

The typical solution to confinement is use some form of container, that is to envelope a process running a program in some kind of isolation layer, that prevents it from accessing the resources belonging to the invoking user. The isolation layer can be usually:

A namespace restriction layer like LXC containers, (some aspects of them I mentioned in a recent article here) or QEMU virtual machines, which work by hiding resource a program's process should not access.
A special purpose mandatory access control list systems like AppArmor or SELinux, which work by listing explicitly which resources a process running a program can access, all others being disallowed.

There is a much simpler alternative (with a logic similar to distinct effective and real ids for inodes) that uses the regular UNIX/POSIX permissions and ACLs: to create a UNIX/POSIX user (and group) id per program, and then to allow access only if both the process owner's id and the program's id have access to a resource.

This is in effect what SELinux does in a convoluted way, and is fairly similar also to AppArmor profile files, which however suffer from the limitation of imposing policies to be shared by all users.

Instead allowing processes to be characterized by both the user id of the process owner, and the user id of the program (and similarly for group ids), would allow users to make use of regular permissions and ACL to tailor access by programs to their own resources, if they so wished.

Note: currently access is granted if the effective user id or the effective group id of a process owner have permission to access a resource. This would change to granting access if [the process effective user id and the program user id both have permission] or [the process effective group id and the program group id both have permission]. Plus some additional rules, for example that a program id of 0 has access to everything, and can only be set by the user with user id 0.

Note: of course multiple programs could share the same program user and group id, which perhaps should be really called foreign id or origin id.

170129 Sun: The mainframe development problem and MPI

At some point in order to boost the cost and lifestyles of their executive most IT technology companies try to move to higher margin market segments, which usually are those with the higher priced products. In the case of mainframes this meant an abandonment of lower priced market segments to minicomputer suppliers. This created a serious skills problem: to become a mainframe system administrator or programmer a mainframe was needed for learning, but mainframe hardware and operating systems were available only in large configurations at very high price levels, and therefore used only for production.

IBM was particularly affected by this as they really did not want to introduce minicomputers with a mainframe compatible hardware and operating system, to make sure customers locked-in to them would not be tempted to fall back on a minicomputers, and the IBM lines of minicomputers were kept rigorously incompatible with and much more primitive than the IBM mainframe line. Their solution, which did not quite succeed, was to introduce PC-sized workstations with a compatible instruction set, to run on them a version of the mainframe operating system, and even to create plug-in cards for the IBM PC line, all to make sure that the learning systems could not be used as cheap production systems.

Note: The IBM 5100 PC-sized mainframe emulator became an interesting detail in the story of time traveller John Titor.

The problem is more general: in order to learn to configure and program a system, one has to buy that system or a compatible one. Such a problem is currently less visible because most small or large systems are based on the same hardware architectures, the intel IA32 or AMD64 ones, and one of two operating system, MS-Windows or Linux, and a laptop or a desktop thus have the same runtime environment as a larger system.

Currently the problem happens in particular for large clusters for scientific computing, and it manifests particularly for highly parallel MPI programs. In particular many users of large clusters develop their programs on their laptops and desktops, and these programs read data from local files using POSIX primitives, rather than using MPI2 IO primitives. Thus the demand for highly parallel POSIX-like filesystems like Lustre that however are not quite suitable for highly parallel situations.

The dominant issue of such programs is that the issues that arise with them, mostly synchronization and latency impacting speed, cannot be reproduced on a workstation, even if it can run the program with MPI or other frameworks. Many of these programs cannot even be tested on a small cluster, because their issues arise only at grand scale, and may be even different on different clusters, as they may be specific to the performance envelope of the target.

The same problems happen with OpenMP, which is designed for shared memory systems with many CPUs: while even laptops today have some CPUs that share memory, the real issues happens with systems have have a few dozen CPUs and non-uniform shared memory, and with systems that have several dozen or even hundreds of CPUs, and the issue they have also arise only at scale, and vary depending on the details of the implementation.

It is not a simple problem to solve, and is a problem that limits severely the usefulness of highly parallel programs on large clusters, as it limits considerably their ecosystem.

170128 Sat: An interesting blog post on namespaces

The bottom of this site's index page has a list of sites and blogs similar to this in containing opinions about computer technology, mostly related to Linux and system and network engineering and programming, and I have been recently discovered a blog by the engineers and programmers offering their services via one of the main project based contract work sites.

The blog like every blog has a bit of a promotional role for the site and the contract workers it lists, but the technical content is not itself promotional, but has fairly reasonable and interesting contributions.

I have been particularly interested by a posting on using Linux based namespaces to achieve program and process confinement.

It is an interesting topic in part because it is less than wonderfully documented, and it can have surprising consequences.

The posting is a bit optimistic in arguing that using namespacing, it is possible to safely execute arbitrary or unknown programs on your server as there are documented cases of (moderately easy) programs breeaking out of containers and even virtual machines. But namespace do make it more difficult to do bad things and often raising the level of cost and difficulty achieves good-enough security.

Note: The difficulty with namespaces and isolation is that namespaces are quite complicated mechanisms that need changes to a lot of Linux code, and they are a somewhat forced retrofit into the logic of a POSIX-style system, while dependable security mechanism need to be very simple to describe and code. Virtual machine systems are however even more complicated and error prone.

The posting discusses process is, network interface, and mount namespaces, giving simple illustrative examples of code to use them, and briefly discusses also user id, IPC and host-name namespaces. Perhaps user id namespaces would deserve a longer discussion.

The posting indeed can serve as a useful starting point for someone who is interested in knowing more a complicated topic. It would be nice to see it complemented by another article on the history and rationale of the design of namespaces and related ideas, and maybe I'll write something related to that in this blog.

170124 Tue: A legitimate Unsolicited Commercial Email!

With great surprise I have received recently for the first time in a very, very long time a legitimate and thus non-spam unsolicited commercial email.

The reason why unsolicited commercial emails are usually considered spam is that they are as a rule mindless time wasting advertising, usually automated and impersonal, and come in large volume as a result. For the email I received it was unsolicited and commercial, but it was actually a reasonable business contact email specifically directed at me from an actual person who answered my reply.

Of course there are socially challenged people who regard any unsolicited attempt at contact as a violation, (especially from the tax office I guess :-)), but unsolicited contact is actually pretty reasonable if done in small doses and for non-trivial personal or business reasons.

It is remarkable how rare they are and that's why I use my extremely effective anti-spam wildcard domain scheme.

170120 Fri: What is the "Internet"?

While chatting the question was raised of what is the Internet. From a technical point of view that is actually an interesting question with a fairly definite answer:

There are two Internets, one is an IPv4 internet and the other is an IPv6 internet, and they are entirely distinct and incompatible, even if usually they use the same physical cabling infrastructure, and have very similar designs, and nearly all IPv6 Internet nodes are also on the IPv4 Internet.
Both Internets, of all possible internets, are those built with the conventions (from service port numbers to the list of DNS root name server addresses) defined by the IANA.
Of all possible internets that could be built using the IANA conventions, the two Internets are those where one can connect directly to official USA government sites.

There are IPv4 or IPv6 internets that are distinct from the two Internets, but usually they adopt the IANA conventions and have some kind of gateway (usually it needs to do NAT) to the two Internets.

Note: these internets use the same IPv4 or IPv6 address ranges as the two Internets, but use them for different hosts. Conceivably they could also use the same port numbers for different services: as port 80 has been assigned by IANA for HTTP service, a separate internet could use port 399. But while this is possible I have never heard of an internet that uses assigned numbers different from the IANA ones, except for the root DNS servers.

But it is more common to have IPv4 or IPv6 internets that differ from the two Internets only in having a different set of DNS root name server addresses, but are otherwise part, at the transport and lower levels, of the two Internets.

There is a specific technical term to indicate the consequences of having different sets of DNS root name servers, naming zone. Usually the naming zones for internets directly connected to the two Internets overlap and extend with those of the two Internets, they just add (and sometimes redefine or hide) the domains of the Internets.

Note: Both the IPv4 and the IPv6 Internets share the same naming zone, in the sense that the IPv4 DNS root servers and the IPv6 DNS root server serve the same root zone content by convention. This is not necessarily the case at deeper DNS hierarchy levels: it is a local convention whether a domain resolves to an IPv4 and IPv6 address that are equivalent as in being on the same interface and the same service daemons being bound to them.

170107 Sat: F2FS and Bcachefs

Two relatively new filesystem designs and implementations for Linux are F2FS and Bcachefs.

The latter is a personal project of the author of Bcache, a design to cache data from slow storage onto faster storage. It seems very promising, and it is one of the few with full checksumming, but it is not part yet of the default Linux sources, and work on it seems to be interrupted, even if the implementation of the main features seems finished and stable.

F2FS was initially targeted at flash storage devices, but is generally usable as a regular POSIX filesystem, and performs well as such. Its implementation is also among the smallest with around half or less the code size of XFS, Btrfs, OCFS2 or ext4:

   text    data     bss     dec     hex filename
 237952   32874     168  270994   42292 f2fs/f2fs.ko

Many congratulations to its main author Jaegeuk Kim, a random Korean engineer in the middle of huge corporation Samsung, for his work.

Since the work has been an official Samsung project, and F2FS is part of the default Linux sources, and is widely used on Android based cellphone and tablet devices, it is likely to be well tested and to have long term support.