This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
AWS announced
today a long term
leasing offer
that reduces considerably the cost per hour of using their
virtual machines. With the new pricing a commitment to leasing
something like a
m4.4xlarge
instance (16 cores, 64GiB RAM, 2gb/s storage connection, no
storage) would cost instead of
$0.80/h
$0.3008/h
with an Instance Family Savings Plan
paid upfront for 3
years, that is just under $8,000 plus taxes, or $16,000 plus
taxes for 6 years, if it is on an extended depreciation
period. A really nice server like that costs less then $2,000
plus taxes to buy outright. AWS is in effect charging over
$2,000 per year plus taxes to host it and give an availability
guarantee, when a colocation
service may
charge around $1,000 a year (but they do not guarantee
availability of the server itself) for a server that cost
$2,000 plus taxes to buy.
What AWS have done with their new pricing option is to to strip away most of the value of the "flexibility" argument for AWS, because buying up front 3 years of virtual machine time is pretty inflexible. They offer flexibility in that the prepaid time applies to instances in a family, but for any customer with a small or large fleet of systems that is pretty much irrelevant.
Therefore the only rationales for using a service like AWS seems to me to be:
I have been using the NFS Ganesha for a while and recently I discovered a significant issue with it and contemporary (4.4.0 for Ubuntu 16 and 4.15.0 for Ubuntu 18) Ubuntu kernels: it somehow formats responses to the Linux kernel NFSv4 client in such a way that even if correct triggers a bug in the NFS client such that it ignores some entries in the responses to the READDIR and READDIRPLUS operations.
This for example means that rm -rf DIR usually does not remove all files under DIR and itself, but gives an error claiming that the directory is not empty: that's because rm first lists all the entries in that directory, and since some entries are ignored, they are not deleted.
This seems to be an issue with those Ubuntu kernels, as it does not happen under Fedora 30 that uses a 5.2 series kernel, under Ubuntu LTS 18 with a 5.0 kernel, and does not happen with the NFS nfs-kernel-server implementation, which I think is not as flexible and maintainable as NFS Ganesha.
Note: the client side bug seems to have been fixed in 4.17, and depends also on the value of rsize: the larger the less often it happens. A particular example with a directory with 90 entries:
$ uname -a; for N in 1024 2048 4096 6144 8192 12288 16384 32768 65536; do echo -n "$N => "; sudo mount -t nfs -o rw,vers=4,proto=tcp,timeo=10,intr,rsize=$N azara:/scratch /mnt/tmp && ls /mnt/tmp/test | wc -l; sudo umount /mnt/tmp; done Linux noether 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux 1024 => 90 2048 => 81 4096 => 86 6144 => 86 8192 => 88 12288 => 88 16384 => 90 32768 => 90 65536 => 90
In UNIX systems users have both a name and a number, and it is the number that is used for authorization even if it is the name that is used for authentication (several user accounts can share a user number but have different passwords), and each UNIX system instance has a user name space and a user number space that are incommensurable with those of other UNIX systems instances.
Note: here user
means
eithe user or group, as they are handled similar in
UNIX systems, and user number means uid
and group numbers means gid
.
This means that a user name or number can have completely different meanings on difference UNIX instances, for example correspond to different people.
This poses a problem both for transferring filetrees on media between UNIX system instances, and sharing them over a network: the only sensible action by default is to pretend that the foreign filetree is owned by a special user name and user number that is presumed meaningless.
However it is possible, at a site with a central system administration, to assign the same meanings to the same user names, or the same user numbers, or to different ones, by convention.
In that case NFS v3 and v4 support those conventions, and the way in which they are supported is somewhat peculiar and not entirely well documented, and for NFS v3 it is quite simple:
For NFS v4 there are four different options:
domainwhich is just a string, with the convention that usernames tagged with the same string are in the same namespace, that is commensurable, while those in tagged with different domains are incommensurable.
In all these options commonly available implementations don't allow any specific mapping, for example from remote tagged username V@B to local tagged username U@A.
In practice the only useful options are to share on site all usernames to be tagged with the same domain, or to share on a site all usernumbers, with a central repository of account names or numbers, and usually that makes easy to have the same user names and user numbers everywhere.
Note: in order to support use of both NFS v3 and v4 it might seem more important to share the same meaning for user numbers, but if Kerberos is used (as it should) to authenticate NFS users, since it uses only names, names should also be the same. Many NFS v4 Kerberos subsystems allow arbitrary mappings between Kerberos user names and UNIX user names, but that ideally should be avoided.
The Linux NFS kernel client uses as the default mapping method the nfsidmap executable, the NFS kernel server uses the rpc.idmapd daemon, and both use the nfsidmap library; the NFS Ganesha server uses that library directly.
There are some interesting non-obvious details about the nfsidmap executable and library:
kernel keyring, and if a mapping is uncached they use an upcall to the nfsidmap executable (linked to the library) passing it the item (name or number) to map and the keyring identifier to which to add the mapping.
squashing.
The NFS server Ganesha has some relevant but somewhat underdocumented settings:
In Linux the specifications of link(2) and rename(2) turn out to be different from the traditional UNIX ones in that they fail not only if the destination is in a different filetree, but also if it is in the same filetree but under a different mountpoint.
This causes trouble with software like GNOME: to move
a file to the trash
the relevant
GNOME module only checks whether the file and the
user's default trash folder are in the same
filesystem, and fails if they are but under different
mount point.
Note: as the link above shows this change dates at least back to 2005, but not only I missed it, so did the GNOME guys for well over a decade.
The same filetree can appear under different mount points in Linux if it is mounted multiple times (which cannot happen in traditional UNIX) or it is --bind mounted (which is not available in traditional UNIX).
This limitation is regrettable because it is very convenient to use --bind mounts to decouple physical from logical filetrees.
In the specific case the naming convention was to have home directories under /home/ and a scratch area under /scratch/ but because of local storage limitations both were actually in the same filetree /data/local/home/ and /data/local/scratch/ and mounted with --bind to their conventional location.