Computing notes 2017 part one

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

170610 Sat: How challenging is a goal of 20MB/s per TB of storage, and latency

I have previously mentioned the issue of IOPS-per-TB and the related requirement from CERN for disk storage systems to deliver at least 20MB/s of parallel read-write per TB (where at least means worst-case and parallel means interleaved, at least on a single disk) and I was recently astonished to hear someone comment that is pretty easily satisfied given that hard disks have a transfer rate of 150MB/s, when I mentioned that disk larger than 1TB were not likely to achieve that, especially 4-10TB archival disks.

Then I noticed that person was using a MacBook Pro with a very high transfer rate flash SSD with very high IOPS rates and I suspect that detached him from a keen awareness of the performance envelope of rotating disk drives.

Part of the performance envelope of typical contemporary disk drives is that in purely sequential transfers they can do up to 150MB/s on the outer cylinders and up to 80MB/s on the inner cylinders (flash SSDs don't have this difference), and that average random positioning times are around 12ms (flash SSDs have small fractions of a millisecond at least fom reading).

Simultaneous read-write rates of at least 20MB/s (not per TB) require at least one random positioning per transfer, and at 80MB/s it takes around 12ms to transfer 1MB; therefore reading-writing sequentially two parallel streams with 1MB blocks, which are huge, gives around 25ms per block, for a total of 40MB/s, and it takes a read/write transfer size of at least 320KB to achieve at least 20MB/s, in pretty much optimal conditions, and let's say 0.5MB transfers to be have some margin for variability.

It then takes 3MB transfers to deliver 60MB/s parallel read-write suitable for a a 3TB disk, and disks of 4TB cannot deliver 80MB/s except in purely single stream stream mode, that is in practice no disk larger than 3TB can deliver 20MB/s parallel read-write.

Note: That is a 1TB disk capable of 80 random IOPS needs 0.5MB transfers to delivers 20MB/s of parallel read-writes. Put another way, with 64KB transfers a storage unit needs at least 400 random IOPS per TB to deliver 20MB/s of parallel read-writes per TB, and no single disk drive can do that. With 16KB transfers at least 1,700 random IOPS are needed.

Not many applications can afford to read or write with transfers as large as 0.5MB and 3MB. And it is not just the application that must use those transfer sizes: the data on disk must be contiguous for at least the length of the transfer size.

Large transfer sizes also have a very big downside: starting latency for every transfer is driven by the size of the previous transfer, and completion latency by that of the current transfer. A 0.5MB transfer take 18ms to complete, and on a busy 1TB disk that means each transfer not only has a completion latency of 18ms, but also starts at least 18ms after being issued. An IO latency of 18ms is pretty huge. On a 3TB disk with 3MB blocks completion latency goes up to 50ms.

The cause of large latency is that random positioning has a latency of 12ms, and to amortize the cost of that one has to choose large IO sizes, and those add to the latency as it takes even more time to transfer larger blocks. It is possible to mitigate the latency to completion of the current transfer by delivering the large transfers incrementally as a sequence of smaller blocks, but the latency to starting the next transfer still is the full size of the previous transfer.

Also transfer sizes of 0.5MB per operation on a 1TB disk are quite challenging also because on a busy system many IO operations, in particular metadata and logging operations, use much smaller transfer sizes, and in particular opening a file requires usually at least one random short transfer.

It depends on workload and in particular on average IO operation size and average contiguity on disk, but sustaining 20MB parallel read-write per TB of capacity, that requires contiguous transfers of at least 0.5MB, seem to me quite a challenge even on 1TB drives.

Note: the above contains some simple arithmetic, but some familiarity with watching actual disk based storage systems should give a very solid intuitive understanding that 20MB/s parallel worst-case read-write per TB is a pretty demanding goal. I suspect that so many people now have desktops and laptops wioth flash SSDs that intuitive understanding is harder to develop.

As previously noted the IOPS-per-TB issue is per TB of data actually present, rather than of available capacity. An approach that seems astute to some people is to show-off the IO rate and low cost of a storage system using large cheap-per-TB disk drives when it is still far from full, for example when each drive has less than 1TB of data on it, even if its capacity is 4-10TB, because that 1TB of actual data:

In effect in the beginning 4-10TB drives are behaving as if they were particularly fast 1TB drives. But when they fill up their observed average speed will fall to 1/4 or less than that of a when they had only 1TB of data. I have seen that happen quite a few times on storage systems setup by astute people.

170601 Thu Copyright violation in the UK, employment terms

Recent news are that copyright violations of any extent are now treated like theft in the UK, having become a criminal instead of civil matter:

it's now possible for copyright holders to pursue criminal cases for an infringer of any size

This reminded me of being asked some time ago to comment on some employment contract terms that included vesting all copyright in any works created by the employee during that contract to the employer. My opinion as a non-lawyer is that was intended to mean that every single email, photo, letter, blog posts, made by that individual belonged solely to their employers. This included of course every personal one, including family photos, private letters, public blog posts, etc.; my opinion as a non-lawyer is that whether that was enforceable was more likely than not, but also that it would be extremely expensive for an individual to contest that in court. If it were enforceable two consequences might happen in my opinion:

Now in the past such consequences in my opinion would be fairly small, except for the cost of a court case, because if the matter was civil the employer would have to prove and quantify damages, and I reckon that any claim to damages for giving an employee's relatives a photo of say their family would be laughed out of court. But I reckon that the employee could still be subject to summary dismissal for contract violation.

But my impression is that if copyright violation of any type is a criminal offense, like stealing a sandwich from a shop is theft regardless of the amount involved, then an employer making employer-unautorized copies of a private email she has written by sending that email to her parents and siblings might well be punished with a jail sentence.

If that were a consequence, employers would have not just have a way to terminate employment for breach on contract any employee who sent a personal email, photo, letter, published a blog post without authorization, but also to prosecute and perhaps to jail them. That would be a very powerful way to put pressure on employees. Any claim that such clauses would never be used in that way is ridiculous: lawyers are as a rule required to make full use of any legal advantage they can find, and they are not shy.

As to that, the same employement terms contained fairly extreme non-compete clauses, even for a low-end back-office clerical job, not an executive or research job. My impression as a non-lawyer is that in the UK they are not enforceable; but if the employer were determined to bring suit, it would be very expensive to defend against it. In some jurisdictions even extreme non-compete clauses are enforceable.

Update 170611: Indeed I have recently read an article reporting a case where they were enforced, probably to as an example to others:

How Noncompete Clauses Keep Workers Locked In

Restrictions once limited to executives are now spreading across the labor landscape — making it tougher for Americans to get a raise.

170520 Sat: Hard links: not for GNU/Linux GUI users?

As I was trying out some GUI file managers (notably KDE's Dolphin and Konqueror) to rename and move files inside a directory in which I had let too many file name entries accumulate I noticed that the option to hard link a file was missing. That is oddd as hard linking is a pretty fundamental UNIX operation and very useful to have different names refer to the same file, so I did a web search and discovered a relevant bug report and this comment in it:

I must say that I'm strongly against such a feature. The entry "Clone" (or whatever it's called) will be useful for only a few, but confuse ~99% of all users.

Now those who like to create hard links will say "Just add an option for it", but if we add options for every little thing that a few people consider useful (please believe me, there are *lots* of such things),

The comment reveals that the microsoftization of the Linux ecosystem continues, and how deep it has become.

Note: an irony here is that they allow symbolic linking, but even if symbolic links resemble the shortcuts of MS-Windows they are quite different in semantics, risking more confusion.

Sort of like expected the dired module in EMACS has hard-linking, and Midnight Commander has it too. Fortunately. But I liked the ability to rename files inline. I will use writable dired mode instead.

170519 Mon: Stuck laptop CPU fan and temperatures

After looking again at the graph showing that the CPU temperature of my laptop being strictly dependent on CPU load I checked my laptop's collectd graphs and that was confirmed. Which was highly suspicious as at some point the fan should start operating and cool the CPU.

So I did a minimal amount of cooling conduit cleanup: first I hit the relevant area with a rubber mallet to shake it a bit (dust tends to clump) and then I blew hard through it, and used a hoover to aspire it out. The result is quite amazing, ands it is that now without much exception the CPU temperature is fixed pretty much at 50C except during periods of very high load:

The difference is large, and I suspect that for some reason my laptop's CPU must have been stuck. Looking at the collectd graph that must have happened more than a year ago. Also I have cleaned the inflow dust filter grilles of my excellent NZXT H230 desktop case and that by itself has got the temperature of the hard disks in it to drop by on average 3C (from an already rather low level):

170514 Mon: The NSA "hoard" and WannaCry

The recent story of the leaking of the NSA hoard of tools exploiting MS-Windows bugs and the resulting use of that for a ransom-ware project is fairly amusing as well as some comments on it.

My impression is that it is not such a big deal: I would expect every major black-hat group to have a long library of exploits, even packaged as kits some of them resulting from accidental bugs in various software parts, and some of them well designed deliberate bugs created by the people the black-hat teams quite likely have infiltrated in major software companies and groups. The NSA hoard is probably one of several they have, and likely one of the least valuable.

Note: it is not a surprise that someone at Microsoft complained that the NSA had not reported the bug but most likely several other black-hat groups knew of the same bug and nobody reported it either. Because the job of black-hat groups is to use the bugs, not get them fixed, and accidental bugs (if that was an accidental bug) are particularly convenient to keep in a list because the deliberate bugs are too precious to use except in difficult cases.

Therefore one particular exploit kit is not that important in the black-hat world. Also because basic exploit kits relying on accidental bugs are in general not that important, because they are meant to be used towards generic targets: they are in general of value only statistically, that is against generic masses of target systems, a certain percentage of which will have some exploitable bug. Because of their genericity they are useful for gaining access from outside to inside systems, that is from the Internet to some private system.

Therefore as in the current situation their usefulness is for implanting on the target systems ransom-ware or spam-bot modules, rather than steal information or money directly. Also they tend to be a bit too obvious, so involving a high risk of discovery.

What is more worrying for people in my line of work that involves protecting server systems, are the tools involved in targeted black-hat activities, against specific sites or people. These tools are invariably based on some form of access from the inside to the outside, from a private system that has been compromised to the general internet. The difficulty of course is how to compromise a private system, and that is why both the relevant techniques are used for targeted access and also in general can or should be used for targeted access.

Note: some over-entusiastic person kept asking me for pointers to exploit kits for accessing bank computers and taking money from them. That is a really stupid idea because outside-to-inside exploit kits are the wrong approach, and bank computers that handle money flows are a harder target than most even for targeted access tools.
That's why black-hat organizations use exploit kits for generic personal computers hoping for statistical effectiveness (even a small percentage of exploited systems can be profitable) as to opportunities for installation of ransom-ware or spam-ware. It is rather bank executives who misdirect large flows of money quite legally by doing dodgy deals and become rich from the related bonuses. Paraphrasing what someone wise said, robbing a bank is much harder than embezzling from it.

All these techniques are exact replicas of traditional spy techniques (nihil novum sub soli) based on planting something and someone inside, and the way to achieve that is to have someone or something outside that gets taken inside.

Note: for some examples of a discussion of what can be done to add deliberate bugs as to hardware design there is a nice presentation by Bunnie Huang, Impedance Matching Expectations Between RISC-V and the Open Hardware Community given on 2017-05-10 at 11:00am at the 6th RISC-V workshop.

Edward Snowden has published a catalogue of techniques and hardware devices that the NSA had available several years ago, typically based on communication devices that are difficult to see and detect and are planted into the target systems. These are planted into the system before the system is delivered inside, or by someone who takes them from outside to inside. Among those planted outside before the system is taken inside are the software bugs planted by agents of a black-hat organization at the source, while working for a hardware or software company.

These techniques and hardware devices rely on a large budget and protection from a state entity to be effective, but that's at available to the NSA and many other black-hat groups, both government sponsored or independent. Given that I think that defending against a targeted operation by a group at that level is practically impossible for almost any private individual or commercial organization, and that it is already extremely difficult to detect such operations.

In practice the best choice is to avoid being the target of such a high-budget operation. Since these operations are risky and expensive, they are only justified against particularly valuable targets. Not being a valuable target is in general a very desirable situation because typically security budgets in profit (in a wide sense) oriented organizations are massively undersized, even at valuable targets, as they are costs to be incurred without the promise of an immediate profit, just the possibility of minimising the costs of an attack maybe a long while later. Sometimes I commiserate the system administrators in organizations that handle money. I prefer to stick to research data (which is usually worthless without the accompanying metadata, such as the details of the experimental conditions), or public/publishable data like web content, for example advertising content.

170513 Sat: Preventing Linux partition scanning

Some user on IRC reported a very inconvenient situation: for some reason a disk to be accessed from Linux had a malformed partition table that crashed the kernel code that attempted to parse it. It would have been very easy to clear the relevant area of the disk, but that could not be done because by default Linux attempts to read the partition table as soon as a disk is connected.

This was obviously one of the many cases of violation of sensible practices in Linux design due to an excessive desire for automagic, and I could relate to it as I had to maintain system which used unpartitioned disks and printed on every boot a lot of spurious error messages about malformed partition tables (without crashing at least).

It is easy to verify that in the Linux kernel the rescan_partitions function is invoked unconditionally by the __blkdev_get function, and therefore it is impossible to avoid an attempt to find and parse a partition table. Ten years ago a simple patch was proposed to avoid scanning for a partition table at boot time, but it was not adopted.

So I thought about a workaround that ought to work: to use the boot parameter that tells the kernel that a specific block device has a given partition table using the boot kernel parameter blkdevparts. Since the code in rescan_partitions stops looking for a volume label when the first match is found this ought to work. Even if in the check_part table, which is organized alphabetically, the cmdline_partition section follows several checks for some variants of the Acorn volume label type, which should not be mostly a problem.

The overall issue is that because of the ever deepening microsoftization of GNU/Linux culture improper designs are made to be as automagic as possible: in this case the improperty is that the kernel autodiscovers all possible devices, and then autoactivates them, including partition scanning, and then upcalls the appallingly misdesigned udev subsystem to ensure devices are mounted or started as soon as possible.

The proper UNIX-style logic would be for the default to be inverted, that is for the kernel to initiate discovery of devices only on user request, and then to activate discovered devices again only on user request, and to mount or start them again only on user request. With the ability for the system startup scripts to assume those user requests on boot, but with the ability for the user to override the default. But then the Linux kernel designers have even forgotten to specify an abstract device state machine, so it does not seem surprising that an improper activation logic is used.

170506 Sat: Web pages and laptop heat

So I have previously remarked that many web pages keep the CPU busy and that laptops and handhelds are limited by heat and now I am too a double victim: in the past couple of weeks as I have been looking at news and forum sites like The Guardian my laptop (which I have been using as my main system) has shut down abruptly a few times because of overheating, caused entirely by CPU load from free-running Javascript in web pages.

This a graph (using psensor) of the effect of closing a JavaScript tab containing a Disqus forum threads on CPU usage and temperature:

This is in part due to having a somewhat older CPU chip that does not throttle its clock when it gets hotter, but that in a sense would be cheating: a nominally fast CPU that almost transparently becomes slow when it runs at its rated speed.

There is no workaround: like many I use browser extensions that disable JavaScript code (for example NoScript) but several sites insist on dynamic content generation or update via JavaScript, but once a site is given free access to a browser's JavaScript virtual machine they can take as much advantage as possible. Like many I think that executable content is a very bad idea for several reasons, but it is here to stay for a while. Let's hope that enough handheld users scald their hands or laps or suffer from extreme slowness because of executable content sites that they become less popular.

170427 Thu: NFS, GNOME and KDE issues with auto-mounting

I have long used auto-mounters like am-utils (once known as AMD) and recently the Linux-native autofs5 as a more UNIX-style alternative to complex and insecure schemes based on MS-Windows like udev-based mounting on attachment of a network device, as mounting on attempted access and is both safer and results on shorter mounting periods. Because of this I have written that script for autofs5 that turns /etc/fstab into an auto-mount map.

Since recently I upgrade my NFS server configuration I also decided to look at, being somewhat related, my auto-mounting situation, which has had two slightly annoying problems:

The motivation in both cases is that I reckon it is a good practice to leave filetrees umounted as much as possible, and mount them only for the period where they are needed, because:

My investigation about the NFS server Ganesha and auto-unmounting shows that Ganesha opens an exported filetree's top directory on startup, and this means it won't be mounted only if accessed by clients:

#  lsof -a -c ganesha.nfsd /fs/*/.
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
ganesha.n 27646 root    5r   DIR   0,43      918  256 /mnt/export1/.
ganesha.n 27646 root    6r   DIR   0,47       84  256 /mnt/export2/.
ganesha.n 27646 root    7r   DIR    8,8    16384    5 /mnt/export3/.
ganesha.n 27646 root    8r   DIR   8,38     4096    5 /mnt/export5/.
ganesha.n 27646 root    9r   DIR   0,51      106  256 /mnt/export6/.
ganesha.n 27646 root   10r   DIR   0,55      156  256 /mnt/export7/.

The problem with both GNOME and KDE is that their open-file menus and their file managers are not designed to be UNIX-like, where there is a single tree of directories with user-invisible mount points, but they are designed to resemble MS-Windows (and MacOS X) in having a main filetree and a list of volumes with independent and separate filetrees. The other systems do this because on those systems by default when a mountable volume is attached to the system it is mounted statically until it is detached, and they don't have a UNIX-like mount-on-access behaviour.

Because of this they open the system table of mountpoints and read a list of those mountpoints and access them periodically to get their status, and this of course triggers automounting and constant automounmting. This has caused a number of complaints (1, 2, 3, 4), and fortunately these have resulted in some ready-made solutions for GNOME, and similar ones for KDE:

GNOME solution

The problem is that GNOME uses a filesystem layer and daemons called GVFS that by default mounts all mount-points listed in system configuration files. This can be disabled in three ways:

  • Per user with:
    gsettings set org.gnome.desktop.media-handling \
      automount false
    gsettings set org.gnome.desktop.media-handling \
      automount-open false
  • Per mount-point by adding the relevant /etc/fstab line the flag x-gvfs-hide.
  • For the whole system change the default by creating a file under /etc/dconf/db/local.d/ containing:
    [org/gnome/desktop/media-handling]
    automount=false
    automount-open=false
    and then running dconf update.
KDE solution

KDE used the kio system, but that is not quite involved, instead there is an automounter daemon per-user, and some common plugins mount by default any available mount-points. To disable all these for a user:

  • In the KDE Runner settings ensure that the Devices plugin is disabled.
  • In the KDE Dolphin Places panel mark as hidden all mountpoints.
  • Run kcmshell4 device_automounter_kcm and enable Only automatically mount removable media that has been manually mounted before and disable the other two options, and Forget Device those listed under Device Overrides (not really sure about this).
  • Run kcmshell4 kcmkded and disable Removable devices automounter.

There is a patch to disable mount-point auto-mounting entirely (but only for those of type autofs) but is is not in the KDE version I am using.

170424 Mon: Btrfs and NFS service and NFS daemon Ganesha

So I am using Btrfs, despite it having some design and implementation flaws in its more advanced functionality, because it is fairly reliable as a single device filesystem with checksums and snapshots.

Because of its somewhat unusual redirect-on-write (a variant of COW) design and its many features it has somewhat surprising corner cases, and one of them is that direct IO writes does not update the block checksums (and direct IO reading does not verify them).

Note: this does not mean that checksums are missing when block is written by direct IO: as the above mailing list post says, they are computed when the write is initiated, but the block can still be modified by the application after that, and the already-computed checksum is not updated, because the application has no way to indicate, but initiating a second write, that the block has been updated. That is direct IO does not support stable writes which is problem in any filesystem, but with Btrfs can result in the block and checksum not matching.
If the block is modified before the write actually starts, the block written to persistent storage will be actually correct and consistent, but the checksum will be wrong; if the block is modified while the write is in progress, it is possible for the write to contain part original block and part updated block, and the block will be consistent and the checksum wrong. Note that in the case of block in chunks with profiles that have redundancy more complicated cases can arise: for example for a block with a raid1 profile the block checksum can be computed, and the block can be written from memory before it is modified to one copy, and after it is modified to the other.
On reads the checksums are indeed checked, but no error is returned to the application if they fail to verify, only a warning is printed to the system log. This is because it is possible for the block to be correct and the checksum wrong.

Update 170813: An earlier report that non-stable writes can be an issue with nfs-kernel-server was unfortunately optimistic because of an unclear error report. When nfs-kernel-server reads from Btrfs volumes the block checksums are read and checked but not verified, and in case of check failure a warning is printed by no error is returned. That means that potentially NFS clients get corrupted block without any error indication, despite the presence of checksums, as reported in this IRC conversation:

2017-02-20 14:32:28 <multicore> darkling: with direct-io csums errors don't result in I/O errors
2017-02-20 14:33:01 <multicore> so do just a nfs share from a btrfs and feed gargage to your clients :/
...
2017-02-20 14:51:55 <Walex> Do direct IO *writes* update checksums?
2017-02-20 14:52:12 <Ke> yes
...
2017-02-20 14:52:28 <Walex> Ke: Ahh so only read don't check them.
2017-02-20 14:53:19 <Ke> Walex: not sure, but the problem is that pages that are modeified during the write will not get updated checksums accordingly
2017-02-20 14:53:53 <multicore> Walex: during DIO reads there's a csum error logged but it doesn't result in EIO
2017-02-20 14:54:25 <Walex> Ke: so direct IO writes don't actually update checksums. My question was not about EIO, it was to update the "Gotchas" section
2017-02-20 14:54:57 <Walex> Is it fair to say that Direct IO reads and writes simply *ignore* checksums?
2017-02-20 14:55:03 <Ke> they do update them to match whatever they read when they read it
2017-02-20 14:55:32 <Ke> when writing that is
2017-02-20 14:55:47 <Ke> the page is not write protected during DIO
2017-02-20 14:55:57 <Ke> it can be modified at any time
...
2017-02-20 14:56:23 <Walex> Ke: does direct IO read verify checksums?
2017-02-20 14:56:28 <Ke> no idea
...
2017-02-20 14:56:46 <multicore> Walex: yes
2017-02-20 14:57:40 <Walex> multicore: so direct IO read and write do the checksum thing, but a block written with direct IO may be modified after the checksum has been computed.
2017-02-20 14:57:52 <Walex> modified in memory I mean.
2017-02-20 14:58:25 <Walex> so the issue is "stable" writes I guess.
2017-02-20 14:58:46 <multicore> Walex: i don't know
2017-02-20 14:58:47 <Walex> I am asking actually, not stating this.
2017-02-20 14:59:53 <Walex> multicore: because if the NFS kernel server does direct IO to Btrfs files, then what happens to checksums during direct IO is a rather big issue.
2017-02-20 15:05:20 <multicore> Walex: i was saved by logcheck because it mailed me about the csum error, had to delete the file from nfs server + client and recover from backups
...
2017-02-20 15:08:23 <multicore> Walex: i was told by someone that with raid1 + csum error the error is fixed but i haven't verified it...
2017-02-20 15:08:34 <multicore> Walex: ...with DIO...
...
2017-02-20 15:11:03 <multicore> this is the reason why i was asking: "<multicore> is there a patch around that turns DIO csum errors to EIO instead just logging the error?"

Note: it is not clear why the nfs-kernel-server checks but does not verify checksums. Perhaps when it writes there is a stable writes issue similar to that arising from direct IO. Also, in both the direct IO and nfs-kernel-server read cases, if a checksum fails on read, and there is enough redundancy to reconstruct valid block, it will be reconstructed. But may fail in the worst cases, where the block is valid but the checksum is wrong, or where there are two copies of the block, but the write was not stable on only one of them.

The obvious solution for NFS access to Btrfs volumes is to use the NFS daemon server Ganesha which runs like the SMB/CIFS daemon server Samba as a user process using ordinary IO, like many server daemons for other local and distributed filesystems. A kernel based server is going to have lower overheads, but for a file server the biggest costs are in storage and in network transfers, and running in user mode is comparatively a very modest cost.

Ganesha is something that I kept looking at, and years ago it was not quite finished and awkward to install and poorly documented. It is currently instead fairly polished, it is a standard package for Ubuntu, and newer versions are easily obtained via a fairly well maintained Ubuntu PPA.

It is still not very well documented, and in particular the existing example configurations are overly simplistic, as well as with some default paths not quite right for my Ubuntu 14 system, so here is my own example:

# /usr/share/doc/nfs-ganesha-doc/config_samples/config.txt.gz
# /usr/share/doc/nfs-ganesha-doc/config_samples/export.txt.gz
# https://github.com/nfs-ganesha/nfs-ganesha/wiki/Ganesha-config

NFS_CORE_PARAM
{
  # https://github.com/nfs-ganesha/nfs-ganesha/wiki/CoreParams

  Nb_Worker		=32;

  # Compare the port numbers with those in 
  # /etc/default/nfs-kernel-server (MNT_Port)
  # /etc/default/quota (Rquota_port)
  # /etc/modprobe.d/local.conf (NLM_Port)
  #   options lockd nlm_udpport=893 nlm_tcpport=893
  # /etc/sysctl.conf (NLM_Port)
  #   fs.nfs.nlm_tcpport=893
  #   fs.nfs.nlm_udpport=893

  NFS_Protocols		=4;
  NFS_Port		=2049;
  NFS_Program		=100003;
  MNT_Port		=892;
  MNT_Program		=100005;

  Enable_NLM		=true;
  NLM_Port		=893;
  NLM_Program		=100021;

  Enable_RQUOTA		=true;
  Rquota_Port		=894;
  Rquota_Program	=100011;

  Enable_TCP_keepalive	=false;
  Bind_addr		=0.0.0.0;
}

NFSV4
{
  DomainName		="WHATEVER";
  IdmapConf		="/etc/idmapd.conf";
}

NFS_KRB5
{
  Active_krb5		=true;
  # This could be "host" to recycle the host principal for NFS.
  PrincipalName		="nfs";
  KeytabPath		="/etc/krb5.keytab";
  CCacheDir		="/run/ganesha.nfs.kr5bcc";
}

EXPORT_DEFAULTS
{
  Protocols		=4;
  Transports		=TCP;

  Access_Type		=NONE;
  SecType		=krb5p;
  Squash		=Root_Squash;
}

EXPORT
{
  Access_Type		=RW;
  SecType		=krb5p;
  Squash		=Root_Squash;

  Export_ID		=1;
  FSAL			{ Name="XFS"; }
  # Must not have symlinks
  Path			="/var/data/pub";
  Pseudo		="/var/data/pub";

  PrefRead		=262144;
  PrefWrite		=262144;
  PrefReadDir		=262144;
}

Two of the interesting aspects of Ganesha is that it can serve not just NFSv3 and NFSv4, but also NFSv4.1, pNFS and even the 9p protocol and has specialized backends called FSAL for various type of underlying filesystems, notably CephFS and GlusterFS but also various other cluster filesystems, plus XFS and ZFS. The backends for the cluster filesystems give direct access to those filesystems without the need of a local client for them, which reduces considerably complications and overheads.

My experience with Ganesha has been so far very positive, the speed is good as expected, and it is rather easier to setup and configure than nfs-kernel-server in particular as to Kerberos authentication; also as to configuration, administration, monitoring and investigation, being a user-level daemon.

170417 Mon: Remote two factor authentication and passwords and passphrases

Remote two-factor authentication requires knowledge or possession of two authenticators, usually one is known and the other is possessed. The obvious example is a password and a mobile phone number that received a once-only validation code, or a password and a random one-time-password from a pre-generated list.

However in many cases there is a weaker form of two-factor authentication, where the authenticator unlocks some container holding the effective authenticator, because in that case access requires both knowledge of the authenticator and possession of the container. This case is common where the authenticator is the passphrase of some encrypted file containing the real authenticator, for example an SSH key passphrase and a the SSH private key file it encrypts.

But there is a more common case that should not be forgotten: possession of the computer itself, that is access can only be local and physical. This does not mean that the computer is only accessible locally, but that the authenticator only works locally.

Now these cases seem fairly obvious, but there is an important case where the distinction matters: when providing an authenticator based on knowledge and it can be observed by third parties, such as when a key-logger is present on the system, or a camera or microphone records the typing of the authenticator. In these cases the authenticator can be recorded. For organizations with resources putting a tiny key-logger in the target system or a tiny camera or a tiny microphone in the room containing can be pretty easy for regular offices and residences, but there is a far more common case: use of a system, such as a laptop, in public, such as on a train, bus, or in a coffee house or internet cafè, because surveillance cameras are extremely common.

Only explicit or implicit two-factor authentication helps against that. If that is a concern all remote accesses must involve two-factor authentication, and all one-factor accesses must be strictly local.

So for example on a UNIX/Linux systems passwd authentication should only be allowed for local log-in (getty etc.) and only key based authentication should be allowed for remote log-in (SSH etc.). Also, the local login password can be further augmented, in case loss of possession of the system is a concern, with an OTP system.

Note: Indeed for quite critical systems a site I know uses password plus OTP for local logins, and manually verified SSH keys for remote access.

In particular where is a concern with observation of passwords, Kerberos (and this includes MS-AD) passwords should not be allowed for remote log-in, because the password can be used then to get remote access entirely by itself; also kinit by the same argument should not be used with just a password.

Fortunately some provision has been made for two-factor authentication using Kerberos and smartcards/certificates (PKINIT) and tentative approaches with OTP (FAST, PAKE).

There is an alternative: to extract from Kerberos a suitable keytab and put it into an encrypted file. Then knowing the passphrase for the encrypted keytab is needed to use it with kinit, and possession of that passphrase and the keytab are both necessary. This option has some downsides, and should be carefully considered.

In theory all passwords and passphrases should be unique, but for primary login (whether local or remote) passwords that is quite tedious. The local login password and the passphrases used to encrypt local copies of the SSH keys (and of the Kerberos keytab) can be the same, as they all require possession of the system or encrypted file; the OTP and Kerberos passwords however need to be different, or else those who figure out the login passwords can use them to achieve local access when possession of the system has been obtained, or remote access via Kerberos.

Note: if local or remote login is achieved by an adversary, all encryption passphrases can be trivially obtained on first use, by installing a key-logger daemon.

170407 Thu: Unusual filesystem properties

I have recently learned that in Btrfs the number of links per directory (as returned by the stat(2) system call) is always 1, and this number has the special meaning that it is not known how many subdirectories a directory has, as a directory always has at least 2 sudirectories. Also the number of inodes maximum and free in a Btrfs filetree is always reported as 0 (as returned by the statfs(2) system call), which has the special meaning that the actual number is not known, while a filetree has always at least 1 inode for the local root directory.

These are significant changes in the semantics of UNIX-style filesystems, and other filesystem types have adopted them. It makes certainly easier to adopt filesystem implementations in which hard-links and directories are implement differently from the tradition in which a directory is a simple list of hard links which consist of link name and inode number.

The relevant code within the popular findutils collection and some other has been updated, but there is a lack of documentation of the special meaning of a link count of 1 for directories or 0 for maximum inodes and free inodes.

They are fairly significant changes, because they are obviously not backwards compatible: other filesystems with fluid statistics use the largest number, which is backwards compatible, becaue both number of links to a directory and number of free or user inodes are used as upper limits in loops that also rely on conditions.

170324 Fri: IPv6 reaches 14% of Google traffic, SixXS will close down

Three years ago IPv6 traffic was seen by Google as hitting 2% of total IP traffic and it is quite remarkable that today it hit 14.22%, and that is 14% of a rather bigger total. It is also remarkable that:

At 14% of total traffic IPv6 is now commercially relevant, in the sense that services need to be provided over IPv6, as IPv6 users cannot be ignored, and accordingly even mass market ISPs like Sky Broadband UK have been giving IPv6 connectivity as standard for several months and it is therefore sad but understandable that transition services like SixXS are being closed down:

Summary

SixXS will be sunset in H1 2017. All services will be turned down on 2017-06-06, after which the SixXS project will be retired. Users will no longer be able to use their IPv6 tunnels or subnets after this date, and are required to obtain IPv6 connectivity elsewhere, primarily with their Internet service provider.

My current ISP does not yet support IPv6, but it has a low delay to a fairly good 6to4 gateway, so I continue to use 6to4 with NAT.

170302 Thu: Some coarse speed tests with Btrfs etc. and small files

In the previous note about a simple speed test of several Linux filesystems star was used because unlike GNU tar it does fsync(2) before close(2) of a written file, and I added that for Btrfs I also ensured that metadata was (as per default) duplicated as per the dup profile, and these were challenging details.

To demonstrate how challenging I have done some further coarse tests on a similar system, copying from a mostly root filesystem (which has many small files) to a Btrfs filesystem, with and without the star option -no-fsync and with Btrfs metadata single and dup in order of increasing write rate with fsync:

Btrfs
target
write
time
write
rate
write
sys CPU
dup fsync 381m 28s 3.3MB/s 13m 14s
single fsync 293m 26s 4.2MB/s 13m 15s
dup no-fsync 20m 09s 61.7MB/s 3m 50s
single no-fsync 19m 58s 62.3MB/s 3m 41s

For comparison the same source and target and copy command with some other filesystems also in order of increasing write rate with fsync:

other
target
write
time
write
rate
write
sys CPU
XFS fsync 318m 49s 3.9MB/s 7m 55s
XFS no-fsync 21m 24s 58.2MB/s 4m 10s
F2FS fsync 239m 04s 5.2MB/s 9m 09s
F2FS no-fsync 21m 32s 57.8MB/s 5m 02s
NILFS2 fsync 118m 27s 10.5MB/s 7m 52s
NILFS2 no-fsync 23m 17s 53.5MB/s 5m 03s
JFS fsync 21m 45s 57.2MB/s 5m 11s
JFS no-fsync 19m 42s 63.2MB/s 4m 24s

Notes:
Source filetree was XFS on a very fast flash SSD, so not a bottleneck.
Source filetree 70-71GiB (74-75GB) and 0.94M inodes (0.10M of which directories), of which 0.48M under 1KiB, 0.69M under 4KiB and 0.8M under 8KiB (the Btrfs allocation block size).
Observed occasionally IO with blktrace and blkparse and with fsync virtually all IO was synchronous, as fsync on a set of files of 1 block is essentially the same as fsync on every block.
Because of its (nearly) copy-on-write nature Btrfs handles transactions with multiple fsyncs particularly well.

Obviously fsync per-file on most files is very expensive, and dup file metadata is also quite expensive, but nowhere as much as fsync. A very strong demonstration that the performance envelope of storage system is rather anisotropic. It is also ghastly interesting that given the same volume of data IO with fsync costs more than 3 times the system CPU time as without. Some filesystem specific notes:

170228 Tue: Some coarse speed tests with various Linux filesystems

Having previously mentioned my favourite filesystems I have decided to do again in a different for a rather informative, despite being simplistic and coarse, test of their speed similar to one I did a while ago, with some useful results. The test is:

It is coarse, it is simplistic, but it gives some useful upper bounds on how filesystem does in a fairly optimal case. In particular the write test, involving as it does a fair bit of synchronous writing and seeking for metadata, is fairly harsh; even if, as the test involves writing to a fresh, empty filetree, pretty much an ideal condition, it does not account at all for fragmentation on rewrites and updates. The results, commented below, sorted by fastest, in two tables for writing and reading:

type write
time
write
rate
write
sys CPU
JFS 148m 01s 72.4MB/s 24m 36s
F2FS 170m 28s 62.9MB/s 26m 34s
OCFS2 183m 52s 58.3MB/s 36m 00s
XFS 198m 06s 54.1MB/s 23m 28s
NILFS2 224m 36s 47.7MB/s 32m 04s
ZFSonLinux 225m 09s 47.6MB/s 18m 37s
UDF 228m 47s 46.9MB/s 24m 32s
ReiserFS 236m 34s 45.3MB/s 37m 14s
Btrfs 252m 42s 42.4MB/s 21m 42s
type read
time
read
rate
read
sys CPU
F2FS 106m 25s 100.7MB/s 66m 57s
Btrfs 108m 59s 98.4MB/s 71m 25s
OCFS2 113m 42s 94.3MB/s 66m 39s
UDF 116m 35s 92.0MB/s 66m 54s
XFS 117m 10s 91.5MB/s 66m 03s
JFS 120m 18s 89.1MB/s 66m 38s
ZFSonLinux 125m 01s 85.8MB/s 23m 11s
ReiserFS 125m 08s 85.7MB/s 69m 52s
NILFS2 128m 05s 83.7MB/s 69m 41s

Notes:
The system was otherwise quiescent.
I have watched the various tests with iostat, vmstat, and looking at graphs produced by collectd, displayed by kcollectd and sometimes I have used blkstat blktrace; I have also used occasionally used strace to look at the IO operations requested by star.
Having looked at actual behaviour, I am fairly sure that all involved filesystem respected fsync semantics.
The source disk seemed at all times to not be the limiting factor for the copy, in particular as streaming reads are rather faster, as shown above, than writes.

The first comment on the numbers above is the obscene amount of system CPU time taken, especially for reading. That the system CPU time taken for reading being 2.5 times or more higher than that for writing is also absurd. The test system has an 8 thread AMD 8270e CPU with a highly optimized microarchitecture, 8MiB of level 3 cache, and a 3.3GHz clock rate.

The the system CPU time for most filesystem types is roughly the same, again especially for reading, which indicates that there is common cause that is not filesystem specific. For F2FS the system CPU time for reading is more than 50% of the elapsed time, an extreme case. It is interesting to see that ZFSonLinux, which has uses its own cache implementation, ARC has a system CPU time of roughly 1/3 that of the others.

That Linux block IO involves an obscene amount of system CPU time to do IO I had noticed already over 10 years ago and that the issue has persisted so far is a continuing assessment of the Linux kernel developers in charge of developing the block IO subsystem.

Another comment that applies across all filesystems is that the range of speeds is not that different, all of them had fairly adequate, reasonable speeds given the device. While there is a range of better to worse, this is to be expected from a coarse test like this, and a different ranking will apply to different workloads. What this coarse test says is that none of these filesystems is is particularly bad on this, all of them are fairly good.

Another filesystem independent aspect is that the absolute values are much higher, at 6-7 times better, than those I reported only four years ago. My guess is is this mostly because the previous test involved the Linux source tree which contains a a large number of very small files; but also because the hardware was an old system that I was no longer using in 2012, indeed I had not used since 2006.

As to the selection of filesystem types tested, the presence of F2FS, OCFS2, UDF, NILFS2 may seem surprising, as they are considered special-case or odd filesystems. Even if F2FS was targeted at flash storage, OCFS2 at shared-device clusters, UDF at DVDs and BDs, and NILFS2 at "continuous snapshotting", they are actually all general purpose, POSIX-compatible filesystems that work well on disk drives and with general purpose workloads. I have also added ZFSonLinux, even if I don't like for various reasons, as a a paragon. I have omitted a test of ext4 because I reckon that it is a largely pointless filesystem design, that exists only because of in-place upgradeability from ext3, which in turn was popular only because of in-place upgradeability from ext2, when the installed base of Linux was much smaller. Also OCFS2 has a design quite similar to that of the ext3 filesystem, and has some more interesting features.

Overall the winners, but not large margins, from this test seem to be F2FS, JFS, XFS, OCFS2. Some filesystem specific notes, in alphabetical order:

170214 Mar: New Seagate disc drives and their declared duty cycles

Some time ago I mentioned an archival disk drive model from HGST which had a specific lifetime rating for total reads and writes, of 180TB per year as compared to the 550TB per year of a similar non-archival disk drive.

The recent series of IronWolf, IronWolf Pro and Enterprise Capacity disk drives from Seagate which are targeted largely at cold storage use have similar ratings:

It is interesting also that the lower capacity drives are rated for Nonrecoverable Read Errors per Bits Read, Max of 1 per 10E14 and the large ones for 1 in 1E15 (pretty much industry standard), because 180TB per year is roughly 10E13, and the drives probably are designed to last 5-10 years (they have 3-5 years warranties).

170205 Sun: A straightforward alternative to the setuid mechanism too

Having just illustrated a simple confinement mechanism for UNIX/Linux systems that uses the regular UNIX/Linux style permissions, I should add that the same mechanism can also replace in one simple unified mechanism the setuid protection domain switching of UNIX/Linux systems. The mechanism would be to add to each process, along with its effective id (user/group) what I would now call a preventive id with the following rules:

Note: there are some other details to take care of, like apposite rules for access to a process via a debugger. The logic of the mechanism is that it is safe to let a process operate under the preventive id of its executable, because the program logic of the executable is under the control of the owner of the executable, and that should not be subverted.

The mechanism above is not quite backwards compatible with the UNIX/Linux semantics because it makes changes in the effective or preventive ids depend on explicit process actions, but it can be revised to be backwards compatible with the following alternative rules:

Note: the implementation of either variant of the mechanism is trivial, and in particular adding preventive id fields to a process does not require backward incompatible changes as process attributes are not persistent.

The overall logic is that in the UNIX/Linux semantics for a process to work across two protection domains it must play between the user and group ids; but it is simpler and more general to have the two protection domains identified directly by two separate ids for the running process.

170204 Sat: Aggregate cost of AWS servers

BusinessWeek has an interesting article on some businesses that offer tools to minimize cloud costs and a particularly interesting example they make contains these figures:

Proofpoint rents about 2,000 servers from Amazon Web Services (AWS), Amazon.com’s cloud arm, and paid more than $10 million in 2016, double its 2015 outlay. “Amazon Web Services was the largest ungoverned item on the company’s budget,” Sutton says, meaning no one had to approve the cloud expenses.

$10m for 2,000 virtual machines means $5,000 per year per VM, or $25,000 over the 5-year period where a physical server would be depreciated. That buys a very nice physical server and 5 years of colocation including power and cooling and remote hands, plus a good margin of saving, with none of the inefficiencies or limitations of VMs; actually if one buys 2,000 physical servers and colocation one can expect substial discounts over $25,000 for a single server over 5 years.

Note Those 2,000 servers are unlikely to have a large configuration, more likely to be small thin servers for purposes like running web front-ends.

The question is whether those who rent 2,000 VMs from AWS are mad. My impression is that they are more foolish than mad, and the key words in the story above are no one had to approve the cloud expenses. The point is not just lack of approval but that cloud VMs became the default option, the path of least resistance for every project inside the company.
The company probably started with something like 20 VMs to prototype their service to avoid investing in a fixed capital expense, and then since renting more VMs was easy and everybody did it, that grew by default to 2,000 with nobody really asking themselves for a long time whether a quick and easy option for starting with 20 systems is as sensible when having 2,000.

Cutting VM costs by 10-20% by improving capacity utilization is a start but fairly small compard to rolling their own.

I have written before that cloud storage is also very expensive, and cloud systems also seem to be. Cloud services seem to me premium products for those who love convenience more than price and/or have highly variable workloads, or those who need a builting content distribution network. Probably small startups are like that, but eventually they start growing slowly or at least predictably, and keep using cloud services by habit.

170202 Thu: A straightforward alternative to confinement mechanisms

There are two problems in access control, read-up and write-down, and two techniques (access lists and capabilities). The regular UNIX access control is aims to solve the read-up problem using an abbreviated form of access control lists, and POSIX added extended access control lists.

Preventing read-up with access control lists is a solution for preventing unintended access to resources by users, but does not prevent unintended access to resources by programs, or more precisely by the processes running those programs, because a process running with the user's id can access any resource belonging to that user, and potentially transfer the content of the resource to third parties such as the program's author, or someone who has managed to hack into the process running that program. That is, it does not prevent write-down.

The typical solution to confinement is use some form of container, that is to envelope a process running a program in some kind of isolation layer, that prevents it from accessing the resources belonging to the invoking user. The isolation layer can be usually:

There is a much simpler alternative (with a logic similar to distinct effective and real ids for inodes) that uses the regular UNIX/POSIX permissions and ACLs: to create a UNIX/POSIX user (and group) id per program, and then to allow access only if both the process owner's id and the program's id have access to a resource.

This is in effect what SELinux does in a convoluted way, and is fairly similar also to AppArmor profile files, which however suffer from the limitation of imposing policies to be shared by all users.

Instead allowing processes to be characterized by both the user id of the process owner, and the user id of the program (and similarly for group ids), would allow users to make use of regular permissions and ACL to tailor access by programs to their own resources, if they so wished.

Note: currently access is granted if the effective user id or the effective group id of a process owner have permission to access a resource. This would change to granting access if [the process effective user id and the program user id both have permission] or [the process effective group id and the program group id both have permission]. Plus some additional rules, for example that a program id of 0 has access to everything, and can only be set by the user with user id 0.

Note: of course multiple programs could share the same program user and group id, which perhaps should be really called foreign id or origin id.

170129 Sun: The mainframe development problem and MPI

At some point in order to boost the cost and lifestyles of their executive most IT technology companies try to move to higher margin market segments, which usually are those with the higher priced products. In the case of mainframes this meant an abandonment of lower priced market segments to minicomputer suppliers. This created a serious skills problem: to become a mainframe system administrator or programmer a mainframe was needed for learning, but mainframe hardware and operating systems were available only in large configurations at very high price levels, and therefore used only for production.

IBM was particularly affected by this as they really did not want to introduce minicomputers with a mainframe compatible hardware and operating system, to make sure customers locked-in to them would not be tempted to fall back on a minicomputers, and the IBM lines of minicomputers were kept rigorously incompatible with and much more primitive than the IBM mainframe line. Their solution, which did not quite succeed, was to introduce PC-sized workstations with a compatible instruction set, to run on them a version of the mainframe operating system, and even to create plug-in cards for the IBM PC line, all to make sure that the learning systems could not be used as cheap production systems.

Note: The IBM 5100 PC-sized mainframe emulator became an interesting detail in the story of time traveller John Titor.

The problem is more general: in order to learn to configure and program a system, one has to buy that system or a compatible one. Such a problem is currently less visible because most small or large systems are based on the same hardware architectures, the intel IA32 or AMD64 ones, and one of two operating system, MS-Windows or Linux, and a laptop or a desktop thus have the same runtime environment as a larger system.

Currently the problem happens in particular for large clusters for scientific computing, and it manifests particularly for highly parallel MPI programs. In particular many users of large clusters develop their programs on their laptops and desktops, and these programs read data from local files using POSIX primitives, rather than using MPI2 IO primitives. Thus the demand for highly parallel POSIX-like filesystems like Lustre that however are not quite suitable for highly parallel situations.

The dominant issue of such programs is that the issues that arise with them, mostly synchronization and latency impacting speed, cannot be reproduced on a workstation, even if it can run the program with MPI or other frameworks. Many of these programs cannot even be tested on a small cluster, because their issues arise only at grand scale, and may be even different on different clusters, as they may be specific to the performance envelope of the target.

The same problems happen with OpenMP, which is designed for shared memory systems with many CPUs: while even laptops today have some CPUs that share memory, the real issues happens with systems have have a few dozen CPUs and non-uniform shared memory, and with systems that have several dozen or even hundreds of CPUs, and the issue they have also arise only at scale, and vary depending on the details of the implementation.

It is not a simple problem to solve, and is a problem that limits severely the usefulness of highly parallel programs on large clusters, as it limits considerably their ecosystem.

170128 Sat: An interesting blog post on namespaces

The bottom of this site's index page has a list of sites and blogs similar to this in containing opinions about computer technology, mostly related to Linux and system and network engineering and programming, and I have been recently discovered a blog by the engineers and programmers offering their services via one of the main project based contract work sites.

The blog like every blog has a bit of a promotional role for the site and the contract workers it lists, but the technical content is not itself promotional, but has fairly reasonable and interesting contributions.

I have been particularly interested by a posting on using Linux based namespaces to achieve program and process confinement.

It is an interesting topic in part because it is less than wonderfully documented, and it can have surprising consequences.

The posting is a bit optimistic in arguing that using namespacing, it is possible to safely execute arbitrary or unknown programs on your server as there are documented cases of (moderately easy) programs breeaking out of containers and even virtual machines. But namespace do make it more difficult to do bad things and often raising the level of cost and difficulty achieves good-enough security.

Note: The difficulty with namespaces and isolation is that namespaces are quite complicated mechanisms that need changes to a lot of Linux code, and they are a somewhat forced retrofit into the logic of a POSIX-style system, while dependable security mechanism need to be very simple to describe and code. Virtual machine systems are however even more complicated and error prone.

The posting discusses process is, network interface, and mount namespaces, giving simple illustrative examples of code to use them, and briefly discusses also user id, IPC and host-name namespaces. Perhaps user id namespaces would deserve a longer discussion.

The posting indeed can serve as a useful starting point for someone who is interested in knowing more a complicated topic. It would be nice to see it complemented by another article on the history and rationale of the design of namespaces and related ideas, and maybe I'll write something related to that in this blog.

170124 Tue: A legitimate Unsolicited Commercial Email!

With great surprise I have received recently for the first time in a very, very long time a legitimate and thus non-spam unsolicited commercial email.

The reason why unsolicited commercial emails are usually considered spam is that they are as a rule mindless time wasting advertising, usually automated and impersonal, and come in large volume as a result. For the email I received it was unsolicited and commercial, but it was actually a reasonable business contact email specifically directed at me from an actual person who answered my reply.

Of course there are socially challenged people who regard any unsolicited attempt at contact as a violation, (especially from the tax office I guess :-)), but unsolicited contact is actually pretty reasonable if done in small doses and for non-trivial personal or business reasons.

It is remarkable how rare they are and that's why I use my extremely effective anti-spam wildcard domain scheme.

170120 Fri: What is the "Internet"?

While chatting the question was raised of what is the Internet. From a technical point of view that is actually an interesting question with a fairly definite answer:

There are IPv4 or IPv6 internets that are distinct from the two Internets, but usually they adopt the IANA conventions and have some kind of gateway (usually it needs to do NAT) to the two Internets.

Note: these internets use the same IPv4 or IPv6 address ranges as the two Internets, but use them for different hosts. Conceivably they could also use the same port numbers for different services: as port 80 has been assigned by IANA for HTTP service, a separate internet could use port 399. But while this is possible I have never heard of an internet that uses assigned numbers different from the IANA ones, except for the root DNS servers.

But it is more common to have IPv4 or IPv6 internets that differ from the two Internets only in having a different set of DNS root name server addresses, but are otherwise part, at the transport and lower levels, of the two Internets.

There is a specific technical term to indicate the consequences of having different sets of DNS root name servers, naming zone. Usually the naming zones for internets directly connected to the two Internets overlap and extend with those of the two Internets, they just add (and sometimes redefine or hide) the domains of the Internets.

Note: Both the IPv4 and the IPv6 Internets share the same naming zone, in the sense that the IPv4 DNS root servers and the IPv6 DNS root server serve the same root zone content by convention. This is not necessarily the case at deeper DNS hierarchy levels: it is a local convention whether a domain resolves to an IPv4 and IPv6 address that are equivalent as in being on the same interface and the same service daemons being bound to them.

170107 Sat: F2FS and Bcachefs

Two relatively new filesystem designs and implementations for Linux are F2FS and Bcachefs.

The latter is a personal project of the author of Bcache, a design to cache data from slow storage onto faster storage. It seems very promising, and it is one of the few with full checksumming, but it is not part yet of the default Linux sources, and work on it seems to be interrupted, even if the implementation of the main features seems finished and stable.

F2FS was initially targeted at flash storage devices, but is generally usable as a regular POSIX filesystem, and performs well as such. Its implementation is also among the smallest with around half or less the code size of XFS, Btrfs, OCFS2 or ext4:

   text    data     bss     dec     hex filename
 237952   32874     168  270994   42292 f2fs/f2fs.ko

Many congratulations to its main author Jaegeuk Kim, a random Korean engineer in the middle of huge corporation Samsung, for his work.

Since the work has been an official Samsung project, and F2FS is part of the default Linux sources, and is widely used on Android based cellphone and tablet devices, it is likely to be well tested and to have long term support.