Computing notes 2019 part two

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

2019 December

2019 November

191109 Sat: Cloud VMs with long term leasing compared to colocation

AWS announced today a long term leasing offer that reduces considerably the cost per hour of using their virtual machines. With the new pricing a commitment to leasing something like a m4.4xlarge instance (16 cores, 64GiB RAM, 2gb/s storage connection, no storage) would cost instead of $0.80/h $0.3008/h with an Instance Family Savings Plan paid upfront for 3 years, that is just under $8,000 plus taxes, or $16,000 plus taxes for 6 years, if it is on an extended depreciation period. A really nice server like that costs less then $2,000 plus taxes to buy outright. AWS is in effect charging over $2,000 per year plus taxes to host it and give an availability guarantee, when a colocation service may charge around $1,000 a year (but they do not guarantee availability of the server itself) for a server that cost $2,000 plus taxes to buy.

What AWS have done with their new pricing option is to to strip away most of the value of the "flexibility" argument for AWS, because buying up front 3 years of virtual machine time is pretty inflexible. They offer flexibility in that the prepaid time applies to instances in a family, but for any customer with a small or large fleet of systems that is pretty much irrelevant.

Therefore the only rationales for using a service like AWS seems to me to be:

Extremely variable workload, not just a significant base workload plus peaks.
Making available all data on systems to state security services.
Ensuring that the computing plant of a business is physically unaccessible to employees to make lock-outs extremely easy.
Large tax or other financial incentives to turn capital investment into a running cost, or to ensure that the business has no physical assets.

2019 October

2019 September

190908 Sun A big issue with NFS Ganesha and the Ubuntu LTS 16 and 18 kernels

I have been using the NFS Ganesha for a while and recently I discovered a significant issue with it and contemporary (4.4.0 for Ubuntu 16 and 4.15.0 for Ubuntu 18) Ubuntu kernels: it somehow formats responses to the Linux kernel NFSv4 client in such a way that even if correct triggers a bug in the NFS client such that it ignores some entries in the responses to the READDIR and READDIRPLUS operations.

This for example means that rm -rf DIR usually does not remove all files under DIR and itself, but gives an error claiming that the directory is not empty: that's because rm first lists all the entries in that directory, and since some entries are ignored, they are not deleted.

This seems to be an issue with those Ubuntu kernels, as it does not happen under Fedora 30 that uses a 5.2 series kernel, under Ubuntu LTS 18 with a 5.0 kernel, and does not happen with the NFS nfs-kernel-server implementation, which I think is not as flexible and maintainable as NFS Ganesha.

Note: the client side bug seems to have been fixed in 4.17, and depends also on the value of rsize: the larger the less often it happens. A particular example with a directory with 90 entries:

$ uname -a; for N in 1024 2048 4096 6144 8192 12288 16384 32768 65536; do echo -n "$N => "; sudo mount -t nfs -o rw,vers=4,proto=tcp,timeo=10,intr,rsize=$N azara:/scratch /mnt/tmp && ls /mnt/tmp/test | wc -l; sudo umount /mnt/tmp; done
Linux noether 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC
2019 x86_64 x86_64 x86_64 GNU/Linux
1024 => 90
2048 => 81
4096 => 86
6144 => 86
8192 => 88
12288 => 88
16384 => 90
32768 => 90
65536 => 90

190904 Wed NFS v3, v4 and types of identity mapping

In UNIX systems users have both a name and a number, and it is the number that is used for authorization even if it is the name that is used for authentication (several user accounts can share a user number but have different passwords), and each UNIX system instance has a user name space and a user number space that are incommensurable with those of other UNIX systems instances.

Note: here user means eithe user or group, as they are handled similar in UNIX systems, and user number means uid and group numbers means gid.

This means that a user name or number can have completely different meanings on difference UNIX instances, for example correspond to different people.

This poses a problem both for transferring filetrees on media between UNIX system instances, and sharing them over a network: the only sensible action by default is to pretend that the foreign filetree is owned by a special user name and user number that is presumed meaningless.

However it is possible, at a site with a central system administration, to assign the same meanings to the same user names, or the same user numbers, or to different ones, by convention.

In that case NFS v3 and v4 support those conventions, and the way in which they are supported is somewhat peculiar and not entirely well documented, and for NFS v3 it is quite simple:

The protocol allows only user numbers to be transmitted, and most implementations do not allow mapping a user number from one system to a number on another system. Therefore user numbers must have the same meaning everywhere, and if that's the case it is easies to share everywhere the same account list (typically the /etc/passwd file and thus the same user names too.

For NFS v4 there are four different options:

By default only usernames are transmitted, and they are tagged with a domain which is just a string, with the convention that usernames tagged with the same string are in the same namespace, that is commensurable, while those in tagged with different domains are incommensurable.
Given this each site can decide on a domain string and share accounts across all NFSv4 servers and clients under that domain, with each UNIX instance potentially having different user numbers for the same username in the same domain.
Note: the domain here is purely a string of characters without inherent meaning, even if it may be the same as the site's DNS name or the name of its Kerberos domain. If Kerberos is used to authenticate NFS users it is entirely possible but pointless to have completely different NFS naming domain and Kerberos domain name.
Usernames may be transmitted without being tagged by a domain, and then they are presumed to be all incommensurable to any username local to an instance.
User numbers, represented as strings, are transmitted and then they are all presumed to be commensurable to the local user numbers, just like in NFS v3.
Both tagged usernames and user numbers may be transmitted and then tagged usernames have priority unless the domain tag is different from the local domain tag, in which case the user number is considered commensurable.

In all these options commonly available implementations don't allow any specific mapping, for example from remote tagged username V@B to local tagged username U@A.

In practice the only useful options are to share on site all usernames to be tagged with the same domain, or to share on a site all usernumbers, with a central repository of account names or numbers, and usually that makes easy to have the same user names and user numbers everywhere.

Note: in order to support use of both NFS v3 and v4 it might seem more important to share the same meaning for user numbers, but if Kerberos is used (as it should) to authenticate NFS users, since it uses only names, names should also be the same. Many NFS v4 Kerberos subsystems allow arbitrary mappings between Kerberos user names and UNIX user names, but that ideally should be avoided.

190901 Sun NFS user/group name or id mapping options

The Linux NFS kernel client uses as the default mapping method the nfsidmap executable, the NFS kernel server uses the rpc.idmapd daemon, and both use the nfsidmap library; the NFS Ganesha server uses that library directly.

There are some interesting non-obvious details about the nfsidmap executable and library:

The library supports four mappings, for user and group name to user and group number, and viceversa.
The library can use three different methods to do a mapping, via a static list, the NSS switch (in practice a wrapper for PAM), and via an LDAP server. However the static method is only available for mappings to and from Kerberos names (this is undocumented).
The kernel components cannot use the nfsidmap directly, and cache the mappings in a kernel keyring, and if a mapping is uncached they use an upcall to the nfsidmap executable (linked to the library) passing it the item (name or number) to map and the keyring identifier to which to add the mapping.
By rewriting nfsidmap it is possible to create arbitrary mappings, but the default version only allows a straight map or squashing.

The NFS server Ganesha has some relevant but somewhat underdocumented settings:

If UseGetPwnam is set to true the mapping from NFS v4 names to numbers is done by the standard UNIX PAM library, and these settings become relevant:
- DomainName: the local tag for NFS v4 user names.
If UseGetPwnam is set to false the mapping is done by the nfsidmap library and these settings become relevant:
- IdmapConf is the name of the configuration file for the nfsidmap library, and usually it contains the local NFS v4 domain name, optionally the corresponding Kerberos domain name, and methods to map NFS v4 user names to local user names in the non-Kerberos and Kerberos cases, plus potentially some static mappings.
In both cases it is possible to set Only_Numeric_Owners to only transmit between servers and clients user numbers. Another setting Allow_Numeric_Owners permits transmitting both user names and numbers, but user names always take precedence, so it is almost pointless.

2019 August

190827 Tue A weird change in the semantics of Linux linking

In Linux the specifications of link(2) and rename(2) turn out to be different from the traditional UNIX ones in that they fail not only if the destination is in a different filetree, but also if it is in the same filetree but under a different mountpoint.

This causes trouble with software like GNOME: to move a file to the trash the relevant GNOME module only checks whether the file and the user's default trash folder are in the same filesystem, and fails if they are but under different mount point.

Note: as the link above shows this change dates at least back to 2005, but not only I missed it, so did the GNOME guys for well over a decade.

The same filetree can appear under different mount points in Linux if it is mounted multiple times (which cannot happen in traditional UNIX) or it is --bind mounted (which is not available in traditional UNIX).

This limitation is regrettable because it is very convenient to use --bind mounts to decouple physical from logical filetrees.

In the specific case the naming convention was to have home directories under /home/ and a scratch area under /scratch/ but because of local storage limitations both were actually in the same filetree /data/local/home/ and /data/local/scratch/ and mounted with --bind to their conventional location.

2019 July

2019 June