Computing notes 2016 part two

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

161229 Thu: Bookmarks with 'org' mode

The WWW is a wonderful if messy library and I keep lists of hyperlinks (bookmarks) to notable parts of it, currently around 6,000. I have switched a few times the tool I use to keep these lists:

The main problems were actually similar across these three GUI bookmark managers:

So I started looking for text-based bookmark list managers, and I found first Buku and tried it. It is a purely command like bookmark database, and does not organize bookmarks by list, but by keywords. It works well, but that is not quite I wanted. So I though about that I wanted and I realized that I really wanted a nested list manager, and that an outline-oriented editor could be used in that way.

Then I remembered that there is a mode for outline editing in EMACS, and that EMACS can open files or hyperlinks in the middle of text, and that there was a further evolution of outline editing with added functionality for maintaining lists of notes, which is org mode. I reckon that in general outline editing is not that useful, but it occurred to me that it may match well looking at and editing nested lists.

Org mode is an extension of outline editing mode, by incorporating some kind of markdown and having more operations available, among them easy ways of sorting lists, moving groups of list items, displaying only entries that match a regular expression, so I started using it.

It is a lot better than the GUI based list managers I tried so far as to managing lists of bookmarks, and even citations. In particular it is much, much faster, every operation being essentially instantaneous, as it should be, on a few hundred KiB of data, and very convenient to use, and I have been finally been able to re-sort and tag and update my bookmark collection. Since EMACS runs also in pure text mode, it also works on the command line in full screen mode. Plus of course since it is built inside a full-featured text editor it has all the power of the text editing tools available.

While I am happy that I found in my old EMACS a good solution I am also sad because I usually find applications built within EMACS like the email reader/writer VM and the SGML/XML structure editor PSGML to be preferable to dedicated tools. I think that this is because of some fairly fundamental issues:

So org mode is working well for me for bookmark lists, and I guess that it would work well for keeping other types of lists. There are indeed a number of org mode enthusiasts (for example 1, 2) who do nearly everything with org mode, as nearly anything can be turned into a list of items, but I haven't reached that stage yet.

161217 Sat: The most interesting filesystem types

One of the more interesting aspects of Linux (the kernel) is the number of filesystem designs that have been added to it. In part I think is because it started with the Minix filesystem which had a number of limitations, then someone implemented xiafs which also had some limitations, and then things started to snowball ad everybody tried to write better alternatives.

Among the current filesystem designs there are some that are generic UNIX-style filesystems and some who are rather specialized as to features, purpose, or not being very UNIX like. Having tried many the ones that I like most are:

161206 Tue: GRUB2 and Linux console video modes

When GRUB2 sets up the system console, if it is graphical, is set to some video mode (resolution and depth), and so when the loaded Linux kernel starts, and they are not necessarily the same.

Usually the defaults are suggested by the BIOS (if it is an IA system), or by the DDC information retrieves from the monitor. Sometimes the default are not appropriate or available, so they must be set explicitly.

Having had a look at the GRUB2 and Linux kernel documentation and done some experiments I discovered that as usual it is somewhat reticent and misleading and the more reliable and complete story is:

I have found that good defaults are 1024x768x32,1024x768,auto for gfxmode and 1024x768-32 for video: virtually any graphics cards support them, and virtually all monitors have that resolution or higher, and that resolution allows for a decently sized console, around 128x38 characters with typical fonts.

161126 Sat: Homogenous systems and heteregenous workloads

Having pondered the match (or rather more often the mismatch) between system and workload performance envelopes I found in an article a pithy way to express this in the common case:

A homogeneous system cannot be optimal for a heterogeneous workload.

My work would have often been a lot more enyable if my predecessors had setup systems accordingly. But the only case where a homogenous cluster can run well a heterogenous workload is when it is oversized for that workload, where the performance envelope of the workload is entirely containe within that of the system, and consequently first impressions of a newly setup homogenous cluster can be positive for a while, only to evolve into disappointment as the cluster workload nears capacity (or worse when it enters a post-saturation regime).

The author of the presenation in which the quote is contained has done a lot of interesting work on latency, which is usually the limiting factor in homogenous systems, and particularly on memory and interconnect latency and bandwidth. Since my usual workloads tend to be analysis rather than simulation workloads, my interest has been particularly to network and storage latency (and bandwidth), which is possibly an even worse issue.

161124 Sat: Specifications of a 1990 UNIX System V workstation

Quite amused by looking at an old email list post about advice on buying a UNIX system from March 1990, suggesting a 386SX 16/32b processor at 20MHz with 4MiB of RAM, a 40MB-80MB disk, and UNIX System V versions Xenix or ESIX, with prices in either UK pounds or USA dollars:

Best configuration is a 386SX, 4 Megs, 1/2 discs either RLL, ESDI or SCSI, for a total of 80-120 megabytes, a VGA card with a 14" monitor, and a QIC-24 tape drive. Software should be either Xenix 386 (more efficient) or ESIX rev. D (cheaper, fast file system).

Here are UK and USA prices for all the above:

	Xenix 386 complete		#905		$1050
	Xenix TCP/IP			#350

	ESIX complete					$800

	386SX with 1 meg		#660		$900
	additional 3 meg		#240		$300

	VGA 16 bit to 800x600		#100		$190
	VGA 16 bit to 1024x768		#150		$250

	VGA mono 14"			#110		$240
	VGA color 14"			#260		
	VGA color 14" multisync		#300		$440
	VGA color 14" to 1024x768	#450		$590

	1 RLL controller		#100		$140
	1 RLL disc 28 ms 40/60 meg	#260		$380
	1 RLL disc 25 ms 40/60 meg	#300		
	1 RLL disc 22 ms 71/100 meg	#450		$570
	1 RLL disc 28 ms 80/120 meg	#520		$570

	1 ESDI controller		#170		$200
	1 ESDI disc 18 ms 150 meg	#660		$1200

	Epson SQ 850			#490
	Desk Jet +			#510		$670
	LaserJet IIP			#810		$990
	Postscript for IIP		#545		$450

	Archive 60 meg tape		#420		$580

Note: given inflation roughly double the prices above to get somewhat equivalent 2016 prices.

That's just over 25 years ago, and it is amusing to me because:

161112 Sat: Old ADSL modem-routers still working well

Recently I changed ISP so that my previously current O2 Wireless Box V stopped being suitable, and I tried two other ADSL modem-routers that I had kept from years ago:

Note: in a previous post I surmised that the Vigor 2800 does not support 6to4 IPv4 packets. That was a mistake: it does allow them through, it just does not do NAT on the encapsulated IPv6 packet, and this just requires a slightly different setup.

It is somewhat remarkable that 12 and 10 year old electronics still work. They have not been used for half of that time, but the age is still there. I guess it helps that they have no moving parts.

It is more remarkable that that they are still usable, and connect at the top speeds available on regular broadband lines to consumers even today, same as 10-12 years ago. Current FTTC lines with VDSL2 and cable lines can do higher speeds, but they cost more and they have required extensive recabling. It looks like that ADSL2+ is about as good as it will ever get on electrical lines, and will be stuck in the 2000s.

161106 Sun: Auditing and "cloud" virtual machines

Recently while discussing "cloud" virtual machine hosting options for a small business a smart person mentioned other customers of the fairly good GNU/Linux virtual machine hosting business he was using being perplexed by the default (optable-out) of the hosting provider of keeping root access to the virtual machines of customers.

My comment on that is many hosting providers realize that most of their customers are not particularly computing literate and thus can be fortgetful of bad situation in their virtual machines, such as security breaches, that can cause trouble to other customers or to the hosting provider itself, and that in such situations a hosting provider had a number of options, listed in order of increasing interventionism:

  1. Notify the customer.
  2. Suggest to the customer remedial action.
  3. Give a deadline to the customer for remedial action.
  4. Intervene directly inside the VM with root access to fix the issue.
  5. Block network access to/from the VM.
  6. Stop the VM entirely.

Since many issues can be easily fixed by direct intervention inside the VM, that both solves the issue from the point of view of the hosting provider, without impacting the availability of the customer's system.

The comment on that was that some customers were reluctant to maintain the default root access to a third party, for privacy or commercial confidentiality reasons, or for auditability. As to this I pointed that a virtual machine hosting provider (that is any of their employees with access, including those working covertly for other organizations) have anyhow complete control over and access into a customer's virtual machines, including any encrypted storage or memory, as they can snapshot and observe or modify any aspect of the virtual machine's state, essentially invisibly to any auditing tools inside the virtual machine, so direct root access was just a convenience for the benefit of the customer.

Note: in a sense a system hosting VMs is a giant all-powerful backdoor to all the VMs it hosts.

Of course given the complete access that "cloud" hosting providers have to the VMs and data of their customers only a rather incompetent or underfunded intelligence gathering organization would refrain from infiltating as many popular hosting providers as possible with loyal maintenance (building, hardware, software) engineers at all major "cloud" businesses, as most intelligence gathering organizations have mandates from their sponsors to engage in pervasive industrial espionage, at the very least.

Note: in the case of a hosting provider manager discovering somehow that one of their engineers have been accessing customer VMs their strongest incentive is to maintain confidence by customers either by doing nothing or by terminating the engineer's contract with excellent benefits and a confidentiality agreement, without investigating the hosting systems to remedy the situation, because such an investigation might cost a lot of money for no customer-visible, never mind customer-invoiceable, benefit.

To which someone else pointed out that in some countries (Germany was mentioned) many businesses keep their own physical systems hosted on their own premises, with suitable Internet links. My comment to this is that this is understandable: even hosted physical systems can be compromised, even if less easily than VMs. Systems purchased and installed by the customer in cages at a hosting provider can also be compromised, even if less easily still. For moderately confidential business data co-located customer-purchased and installed systems in a cage with a mechanical padlock purchased and installed by the customer (cage locks controlled by the hosting provider are obviously not a problem for insiders) seems good enough, but protecting more confidential business data requires keeping the systems on-premises to remove a further class of potential and unauditable access.

That "cloud" VMs (or physical systems) can be compromised at will by hosting provider engineers in an unauditable way is perhaps not commonly considered by their customers with even moderately sensitive data.

Note: Hopefully no bank or medical data are processed on a "cloud" hosted system.

Note: Well encrypted data held on a "cloud" storage system is of course not an issue. But most "cloud" storage vendors make it quite expensive to access data from outside their "cloud", and that often involves significant latency anyhow, so usually "cloud" storage is accessed by systems hosted in the same "cloud" (usually the same region of it).

What "cloud" hosting providers seem most useful for (1, 2) as to both cost and confidentiality is to provide a large redundant international CDN for nearly-public data.

161103 Thu: IPv6 6to4 with NAT

I have mentioned previously how easy (but subtle) it is to setup a 6to4 (1, 2, 3, 4) but I have left a bit vague the issue of NAT with 6to4, which is unfortunately very common for home users.

As 6to4 encapsulates an IPv6 packet inside an IPv4 packet there are two possible layers of NAT, on the containing IPv4 addresses and on the contained IPv6 addresses, and a limitation: since IPv4 packets don't have ports, the internal-to-external address mapping of NAT must be one-to-one. If IPv6 was encapsulated in an IPv4 UDP datagram the port number in the UDP header could be used to map multiple internal addresses to one external address.

This means that there can only be as many 6to4 systems on an internal LAN as there are external addresses available to the NAT gateway, and often that is a single one. The rest of the discussion is based on assuming that there is a single external address.

It is still possible to have many internal systems on IPv6 as long as they route via the 6to4 system, and the internal IPv6 subnet is chosen appropriately. The latter is not difficult as each IPv4 address defines a very large 6to4 subnet, where the prefix for it in first 16 bits contain the 6to4 prefix, and the next 32 bits contain the IPv4 address. So for example:

The first two issues are whether the NAT gateway allows passing IPv4 packets with a protocol type of 41 (IPv6) at all, and whether it performs NAT on them. Both must be possible to use 6to4 in a NAT'ed subnet.

The goal then is to ensure that when a 6to4 packet exits the NAT gateway it has these properties:

The last point is the really important one, so that when replying the target IPv6 system knows that the reply IPv6 packet must be encapsulated in an IPv4 packet or sent to a 6to4 gateway (because of the 2002::0/16 prefix), and the target address of the IPv4 packet is the external address of the NAT gateway.

Therefore the next issues are whether the NAT gateway also NATs the contained IPv6 addresses, and whether the external address is static or dynamic:

So in practice the two simple cases are to use an internal IPv6 based on the static external IPv4 address of the NAT gateway, or if the NAT gateway can map the IPv6 source addresses to use a prefix based on the internal IPv4 address of the 6to4 gateway.

Two full examples based on the following values from the example above, assuming the IPv4 configuration is already done:

The same in abbreviated form for Debian's /etc/network/interfaces:

161009 Sun: Backup drives with NILFS2 filesystems

Currently I am using mostly Btrfs for my home computer filetrees mostly because of Btrfs extensive support for data checksums. My setup also pairs the active disk drive with a backup one where the second is synchronized via rsync to the original every night, and then I also do periodic manual backups to an external drive or two.

I have recently chosen to use NILFS2 for the backup drives, for these main reasons:

161008 Sat: The "spines-leaves" topologies and crossbar switching

Among others Google have adopted (the main part of) the purely-routed network topology that I described a few years ago and that I had implemented earlier at a largish science site, and is based on routers (and servers) being multihomed on multiple otherwise unconnected backbones with connectivity managed by OSPF and ECMP, and using /32 routes to provide topology independent addresses for services.

The Google paper and other descriptions (1, 2 3) refer to them as leaf-spine (or more properly spines-leaves) topologies and claim that they are crossbar-switch topologies as those introduced by Charles Clos quite some time ago to provide interlink several crossbar switches into a whole that performed almost like a single large crossbar. These topologies have been long known also as multi-stage distribution topologies and typically have 3 stages.

The mutiple independent backone network spines-leaves topologies I have used are not quite Clos-style networks; the shape is different as Clos topologies need at least three stages, and spines-leaves topologies are as a rule on two levels; also Clos topologies are pass-through, while spines-leaves topologies connect endpoints among themselves. The main common point is that the second stage or level as a rule involves fewer switches than the first level.

Note: There is thus a semi-plausible argument that spines-leaves topologies are degenerate Clos topologies were the first and third stage switches are the same, but I don't like it.

The main point is not however about the different shape, but the rather different purposes:

The shape of spines-leaves topologies is driven by their primary goal which is resilience achieved by:

Note: the use of /32 host routes on vcirtual interfaces might allow to use entirely LAN-local dynamically allocated addresses (such as the 169.254.0.0/16 IPv4 range) for link-interface addresses.

This delivers resilience which is scalable by adding more spine/backgone routers and/or more leaf/local routers; as to the latter servers (and even clients) can well be multiple-homed themselves onto multiple routers, and running an OSPF daemon on a host is a fairly trivial task. I even had an OSPF daemon running on my laptop. In my largish science installation I had servers that were critical to overall resilience and performance multiple-homed directly on the spine/backbone routers, or those important to specific areas on the leaf/local routers.

The spines-leaves topology is not designed to deliver full symmetrical N-to-N trunking of circuits like Clos topologies because client-server packet flows are not at all like that. However it also gives nice scalable capacity and latency options: adding more spine/backbone or leaf/local routers and thus links and switching capacity to the whole, again largely transparently thanks to OSPF/ECMP routing and /32 host routes.

Finally spines-leaves topologies are very maintainable: because of their dynamic and scalable resilience and capacity it is possible to overprovision them, allowing to take down small or large parts of the infrastructure for maintenance quite transparently, while the rest of the infrastructure continues to provide service. It is also important for maintainability that the router configurations end up being trivially simple:

161006 Mon: Cluster post-saturation regime and disk based storage

A famous principle of system optimization is that removing the main performance limiter exposes the next one. So faster CPUs make memory the next performance limiter, and so on.

But another important detail is that what is a performance limited depends critically on the match or mismatch between the workload and the performance envelope of each system component.

In particular many system components like disk have highly anisotropic performance envelopes, so that their effective capacity is highly dependent on workload.

In particular capacity can be highly dependent not just on the profile of the workload, but also on it size, because effective component capacity often depends nonlinearly on its utilization.

As an example, consider a typical parallel batch processing cluster, with around 20 systems, with a total capacity of around 600 threads and 200 large disks drives, with around 100 threads devoted to background daemons, and 500 available for parallel jobs. Each of the disk drives can do between 100MB/s on a single thread, or 0.5MB/s if used by many threads.

In such a cluster total capacity is at some point inversely proportional to number of threads running that do IO, even if that IO is sequential, because the more threads that do IO on the same disk the lower the total transfer rate that the disk can deliver.

It must be emphasized that what decreases at some point is the total capacity of the cluster, not merely per-thread resources; the per-thread resources will fall at some point more than linearly.

Therefore suppose that overall 1.5 threads can share a disk given their average transfer rate an IOPS requirements: then if more than 200 threads are added to the 100 background threads total cluster capacity will fall because the ach disk will deliver a lower, possibly a much lower aggregate transfer rate. This even if 300 total threads are well below the 600 total capacity of the cluster.

Sometimes this situation is called a post-saturation regime, where in the example above cluster saturation is reached with 300 threads. Once that regime is reached additional load will further reduce capacity, and total time to complete may well be longer than with sequential execution. In an extreme case I saw many years ago running two jobs in parallel took twice as long as running then in series. In the example above running an additional 400 threads reaching 100% CPU occupancy will result in a total completion time probably rather higher than running two sets of 200 threads serially.

Note: for virtual memory the same situation is called thrashing, and outside computing it often loads to gridlock: imagine a city that has a saturation point of 200,000 cars: with an 250,000 cars traffic speed will slow down enormously, and with 300,000 cars probably traffic speed will be close to zero. For car traffic the resources whose total capacity reduces past saturation point are intersections, as cars must stop and start again the more often the more cars use them.

In a similar situation an observant user asked to configure the cluster job scheduler to limit the number of concurrent threads to the number that would not trigger a post-saturation regime on the disks. Most cluster job schedulers can be configured for user-defined consumable resources, for example in the case of and using as example SGE as complex values using qconf -Mc.

For the example above, of the 200 total disks the capacity of around 50-70 is consumed by the background threads, and one can define complex valus for total IOPS and total sequential transfer rate available, to accomodate jobs with mostly random or mostly sequential access patterns, as follows for the example above, given 140 disks remaining capacity:

complex_values IOPS=1300,MBPS=1300

Which is actually a bit optimistic because we have assumed that 200 threads can saturate on average 200/1.5 => 133 disks, but if there are hot spots, because of non-uniform distribution of IO from threads to disk, the post-saturation regime can happen with fewer than 200 threads.

Note: that configuration is even more optimistic because it gives as capacity the raw physical capacity of the 200 example disks. The actual capacity in IOPS and MPBS available to user threads can be significantly lower if there are layers of virtualization and replication in the IO susbsystem, for example if the jobs run in virtual machines over a SAN or NAS storage layer.

When I configured consumables for IOPS and MBPS for the cluster as requested by that observant user I got complaints from other users and management that since this limited the number of concurrent threads it limited the utilization of the cluster. But the relevant utilization was that of the main performance limiter, in that case disk capacity, and that ignoring it and considering number of available thread slots instead would overload disk capacity, thus achieving lower overall utilization; it seemed as if the number of thread slots occupied was considered more important than that of the number of thread slots utilized, so I had to make the specification of those consumable resources optional, which rather defeated their purpose. But the experienced user who had made the request continued to use them, and his cluster jobs tended to run at time when nothing else was running, so at least he benefited.

Note: the capacity reduction in a post-saturation regime on disk is due to a transition from sequential accesses with a per-disk transfer rate of 100-150MB to interleaved (and thus random-ish) accesses with a per-disk transfer rate that can be as low as 0.5-1MB/s. Seek times are for disks the equivalent of stopping and starting at intersections for cars in a city, that is periods of time in which useful activity is not performed, and which increase in frequency or duration when the load goes up.

160924 Sat: The Linux auditing subsystem and limits to auditing

I have been testing for a while a configuration for the Linux auditing subsystem (which seems to me awfully designed and implemented) to monitor changes in system files, and I was looking some time ago at the included rules examples and that left me somewhat amused. The examples mostly are about reporting modification or access for a few critical system files. That is somewhat weak: as per my testing one has to check also all the executables and libraries, because every modified executable can misbehave.

So my experiments have been about adding the major system library and executable directories to be monitored by the auditing subsystem, as a kind of continuous integrity checking, instead of the periodic one by checksum-based integrity checkers or periodic snapshot using filesystems like NILFS2 or Btrfs.

The major functionality of the audit subsystem is indeed to monitor file accesses, using in-kernel access to the inotify subsystem; but it can also monitor use of system entry points, which is unfortunately largely pointless, as there are very many system entry point calls on a running system, and the audit subsystem does not allow a fine grain of filtering as to which specific calls to a system entry point to monitor.

These are reasonable just-in-case measures, but they are not very effective against a well-funded adversary: every library or program in a system or the system hardware can be compromised at the source.

As to libraries and programs, given that the main technique of major security services is to infiltrate other organizations, it is hard to imagine that they would not get affiliated engineers hired by major software companies, or volunteer them to free software projects, so that they could add to the code some backdoors cleverly disguised as subtle bugs. It is hard to guess how many of the security issues in software that get regularly discovered and patched were genuine mistakes or carefully designed ones. But I suspect that the most valuable backdoors are very well disguised and hard to find, probably triggered only by obscure combinations of behaviours.

Probably companies like Microsoft, Google, Redhat, Facebook, SAP, Apple, EA, etc., have had (unknowingly) for a long time dozens if not hundreds of engineers affiliated with the security services (or large criminal organizations) of most major countries (India, UK, China, Israel, Russia, ...). Indeed I would be outraged if the security services funded by my taxes were not doing that.

As to hardware probably there are also engineers affiliated with various security services (or large criminal organizations) in virtually all major hardware design companies, Intel, Apple, ABB, CISCO, etc., where they can also insert into the design of a CPU or another chip or the firmware of peripherals like disks or printers some backdoors cleverly disguised as design mistakes.

But,as the files disclosed by Edward Snowden as to the activities of the NSA in the USA, hardware can be compromised at the product level, not just at the component level: security agencies can afford the expense to intercept shipping boxes and insert in them surveillance devices or backdoored parts, or enter into premises hosting already installed products and do the same, for example replacing USB cables with (rather expensive) identical looking ones containing radios.

Note: a former coworker mentioned reading an article that showed how easy it is to put a USB logger, that is a keylogger, in a computer mouse, or any other peripheral, and there are examples of far more sophisticated and surprising techniques.

Fully auditing third party libraries and programs (whether proprietary or free software), component hardware designs, boxed hardware, and installed hardware is in practice impossible or too difficult, and anyhow very expensive as far as it can go. For serious security requirements the only choice is to make use only of hardware and software that has been developed entirely (down to the cabling, consider again USB cables with built in radios) by fully trusted parties, that is full auditing of the source; for example the government of China have funded the native development of a CPU based on the MIPS instruction set, and no doubt also of compilers, libraries, operating systems entirely natively developed. Probably most other major governments have done the same.

For system administrators in ordinary places the bad news is that they cannot afford to do anything like auditing at the source, and therefore every hardware and software component must be presumed compromised; the good news is that the systems administered are usually rather unlikely to be of such value as to attract the attention of major security services or to be regarded by them as deserving the risk and expense of making use of the more advanced backdoors, or of using NSA-style field teams to intercept hardware being delivered or to modify in place hardware already installed. Also probably for ordinary installations a degree of physical isolation is sufficient to make effective use of the less advanced backdoors too difficult in most cases.

Therefore for ordinary installations the Linux audit subsystem is moderately useful, together with other similar measures, as long as it does not give a feeling of security beyond its limits.

It is also useful for system troubleshooting and profiling, as it can give interesting information on the actual system usage of processes and applications, complementing the strace(1) and inotifywatch(1) user level tools and the SystemTap kernel subsystem.

160907 Wed: Something new in recent times: Arvados Keep

Today in a discussion the topic of what is new and recent in systems came up. In general not much that is new has happened in systems design for a while, never mind recently. Things like IPv6, Linux, even flash SSDs feel new, but they are relatively old. Many other recent developments are not new, but rediscoveries of older designs that had gone out of practice as tradeoffs changed, and have come back into practice as they changed back.

After a bit of thinking I mentioned a distributed filesystem, Arvados Keep because it contains a genuinely different design feature, that needs some explaining.

As mentioned in numerous prvious posts designing scalable filesystems is hard, especially when they are distributed. Scaling data performance envelopes is relative easy in some aspects, for example by using parallelism as in RAID to scale up throughput for mass data access; the difficulty is scaling up metadata operations, which includes both file attributes and internal data structures. The difficulty arises from metadata beng highly structured and interlinked, which mans that mass metadata operations, like integrity auditing or indexing or backups tend to be happen sequentially.

Arvados Keep is a distributed filesystem which has two sharply distinct layers:

The distinctive characteristic is that there is no metadata that lists which hash (that is, which segment) is on which server and storage device.

Since there is no metadata for segment location, and each segment identified uniquely identifies the content of the segment, whole-metadata checks can be paralellized quite easily: each storage server can enumerate in parallel all the segments it has, and then checks the integrity of the content with the hash; at the same time the file naming layer does its won integrity checks in parallel. Periodically a garbage-collector looks at the file-naming database, queries in parallel the storage servers, and reconciles the lists of hashes for the files and those available on the storage backends, deleting the segments that are not referenced by any files.

That is quite new and relatively recent, and helps a lot with scalability, which has been a problem for quite a while.

The absence of metadata linking explicitly files to storage segments is made possible by the use of content addressing for the segments, and it is that which makes parallelism possible in metadata scans. It has however a downside: that locating a segment requires indeed content addressing via the segment identifier. That potentially means a scan of all storage servers and devices every time a segment needs to be located.

That could be improved in several ways, for example via a hint database, or by using multicasting. The current way used by Arvados Keep is to first calculate a likely location based on the number of servers, and check that first, and if the segment identifier is not found on that server, a linear scan.

That works surprisingly well, in part because segment identifiers are essentially random numbers, and the same calculation is of course used when creating a new segment and when looking it up. The calculation is also highly scalable: it takes the same time whether the segments are distributed across 10 or 10,000 servers.

However it does not work that well in another sense of scalable, where the number of server is increased over time: because the location of a segment is fixed by the calculation based on the number of servers when it was created, and it is searched for with the current number of servers.

For a similar situation the Ceph distributed object systems uses a crush map algorithm, which is a table-driven calculation where the table is recomputed when the number of servers changes, but in a way tha preserves the location of segments created with a different number of servers.

This probably could be added to Arvados Keep too, but currently it is not particularly necessary. In part because usually when expansion is done it happens in large increments. However in extreme cases it can result in file opening times of hundreds of milliseconds, as many servers need to be probed to find the segments that make up the file. Arvados Keep is targeted at storage of pretty large files, typically of several gigabytes, and therefore it has a large segment size of 64MiB (Ceph uses 4MiB) which minimizes the number of segments per file, and also means that the cost of opening a file is amortized over a large file.

The other cost is that since thre is no explicit tracking of whether a given segment is in use or not, a periodic garbage collection needs to be performed, but that is needed regardless as an integrity audit, for any type of filsystem design, and it is easy to parallelize.

Overall the design works well for the intended application area, and the fairly unusual and novel decision to use content addressing without any explicit metadata structure cross referencing files and storage segments provides (at some tolerable cost) metadata scalability.

160817 Wed: A subtle downside of filesystem checksums

Today on the Btrfs IRC channel a user who mentioned building his computer out of old parts asked about some corruption reported by Btrfs checksumming, which he wanted to recover from. Some of the more experienced people present then noticed that it was a classic bitflip (one-bit error), inside a filesystem metadata block, and added:

[17:55]        kdave | pipe: the key sequence is ordered, anything that looks 
                       odd is likely a bitflip
[17:55]     darkling | Look at that one and the ones above and below, and 
                       you'll usually see something that fairly obviously 
                       doesn't fit.
[17:55]        kdave | I even have a script that autmates that
[17:56]     darkling | Convert the numbers to hex and sit back in smug delight 
                       at the amazement of onlookers. ;)
[17:56]     darkling | There's even a btrfs check patch that fixes them when 
                       it can...
[17:56]        kdave | we've done the bitflip hunt too many times
[17:56]     darkling | It's my superpower.
[17:56]     darkling | (Well, that and working out usable space in my head 
                       given a disk configuration)

The point being that bitflips are rather more common in their experience as Btrfs users and developers, than in the experience of most users, and in the present case it was likely due to unreliable memory chips.

Obviously bitflips are not caused by Btrfs, so their experience is down to Btrfs detecting bitflips with checksums, which makes apparent those bitflips that otherwise would not be noticed.

That detection of otherwise unnoticed bitflips brings with it the obvious advantage, but also a subtle disadvantage, which is the reason why it is often recommended to use ECC memory when using ZFS which also does checksumming. This seems counterintuitive: if filesystem software checksums can be used to detect bitflips, hardware checksums in the form of ECC may seem less necessary.

The subtle disadvantage is that as a rule checksums are not applied by Btrfs or ZFS to individual fields, but to whole blocks, and that on detecting a bitflip the whole block is marked as unreliable, and a block can contain a lot of fields: for example a block that contains internal metadata, such as a directory or a set of inodes.

Note: in a sense detecting a bitflip in a block via a checksum results in a form of damage amplification, as every bit in the block must be presumed damaged.

In many cases an undetected bitflip happens in parts of a block that are not used, or are not important, and therefore results in no damage or unimportant damage. For example as to filesystem metadata, a bitflip in a block pointer matters a lot, but one in a timestamp field matters a lot less.

ECC memory not only detects but corrects most memory bitflips, avoiding in many cases the writing to disk of blocks that don't match their checksums, or cases of checksum failure because of a memory bitflip after reading a block from disk. This prevents a number of cases of loss of whole blocks of filesystem internal metadata (as well as data).

This is obviously related to another Btrfs aspect, that by default its internal metadata blocks on disk are duplicated even on single disk filesystems. Obviously duplicating metadata on a single disk does not protect against failure of the disk, unlike a RAID1 arrangement (but it may protect against a localized failure of the disk medium though).

But the real purpose of duplicating metadata blocks by default is to recover from detected bitflips: when checksum verification fails on a metadata block its duplicate can be used instead (if it checksums correctly, and usually it will). This minimizes the consequences of most cases of non-ECC memory bitflips.

Obviously duplicating metadata on the same disk is expensive in terms of additional IO when updating it. Therefore it might be desirable in some cases to have more frequent backups, ECC memory, and turn off the default duplication of metadata blocks in Btrfs. So the summary is:

160806 Sat: Blu-Ray discs need deformatting before reformatting

I have been using dual-layer Blu-Ray discs (50GB nominal capacity) for offline backups, both write-once (BD-R DL) and read-write (BD-RE DL), and recently I wanted to reformat one of the latter (to remap some defective sectors), and found that dvd+rw-format cannot do it, as the Blu-Ray drive rejects the command with:

FORMAT UNIT failed with SK=5h/INVALID FIELD IN PARAMETER LIST

After some web searching I discovered that BD-RE discs must be deformatted with xorriso -blank deformat before a BD drive will accept a command to format them. In this they are slightly different in behaviour from both DVD-+RW and CD-RW discs.

160717 Sun: A typical cold-storage system and its cost

While looking for related material I chanced upon a message describing a typical system configured for local cold-storage using Ceph instead of remote third party based services discussed recently:

OSD Nodes:

6 x Dell PowerEdge R730XD & MD1400 Shelves

  • 2x Intel(R) Xeon(R) CPU E5-2650
  • 128GB RAM
  • 2x 600GB SAS (OS - RAID1)
  • 2x 200GB SSD (PERC H730)
  • 14x 6TB NL-SAS (PERC H730)
  • 12x 4TB NL-SAS (PERC H830 - MD1400)

The use of 4TB-6TB NL drives is typical of bulk-data cold-storage (or archival) systems with a very low expected degree of parallelism in the workload, mostly a few uploading threads and a few more threads to download to warm-storage (or cold-storage if used in archive-storage).

There are also a couple of enterprise (the capacity of 200GB suggests this) flash SSDs, very likely for Ceph journaling, and two 600GB SAS disks for the system software.

Such a system could be used for cold-storage or archival-storage, but to me it seems more suitable for archival, as it has a lot of drives and most of them are as big as 6TB. But it could be suitable for cold-storage as long as uploads and download concurrency is limited.

Note: The system configuration seems to be inspired by the throughput category on pages 13 and 16 of these guidelines.

The total raw data capacity is 120TB, and that would translate to a logical capacity of around 40TB using default 3-way Ceph replication for cold-storage of 80TB with a 12+4 erasure code set (BackBlaze B2 uses 17+3 sets, but that seems a bit optimistic to me).

Note: as to speed, with 26 bulk-data disks each capable of perhaps 20-40MB/s of IO over 2-4 threads, and guessing wildly, aggregate rates might be: for cold-storage 200-400MB/s for pure writing over 3-6 processes and 300-800MB/s (depending on how many threads) for pure reading over 5-10 processes; for archival storage it might be half that for writing (because of likely read-modify-write using huge erasure code blocks and waiting for all of them to commit) over 2-4 processes, and 400-800MB/s for pure reading.

Note: To coarsely double check my guess of achievable rates I have quickly setup an MD RAID10 set of 6 mid-disk partitions, with replication of 3, on 1TB or 2TB contemporary SATA drives, and done with fio randomish IO in 1MiB blocks:

soft#  fio blocks-randomish.fio >| /tmp/blocks-randrw-1Mi-24t.txt                                                                   
soft#  tail -13 /tmp/blocks-randrw-1Mi-24t.txt

Run status group 0 (all jobs):
   READ: io=1529.0MB, aggrb=51349KB/s, minb=1886KB/s, maxb=2418KB/s, mint=30191msec, maxt=30491msec
  WRITE: io=1540.0MB, aggrb=51718KB/s, minb=1920KB/s, maxb=2317KB/s, mint=30191msec, maxt=30491msec

Disk stats (read/write):
    md127: ios=24096/23557, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=4310/12770, aggrmerge=210/484, aggrticks=114720/152362, aggrin_queue=267136, aggrutil=94.43%
  sdb: ios=5315/12139, merge=570/1118, ticks=291976/493200, in_queue=785520, util=94.43%
  sdc: ios=4547/12434, merge=456/840, ticks=68920/48540, in_queue=117456, util=35.66%
  sdd: ios=2438/12447, merge=238/828, ticks=42272/38664, in_queue=80936, util=24.40%
  sde: ios=7084/13200, merge=0/40, ticks=153356/143492, in_queue=296840, util=54.44%
  sdf: ios=4252/13204, merge=0/38, ticks=85368/96032, in_queue=181396, util=41.64%
  sdg: ios=2227/13201, merge=0/41, ticks=46428/94248, in_queue=140672, util=33.84%
soft#  cat blocks-randomish.fio 
# vim:set ft=ini:

[global]

kb_base=1024

fallocate=keep
directory=/mnt/md40
filename=FIO-TEST
size=100G

ioengine=libaio
io_submit_mode=offload

runtime=30
iodepth=1
numjobs=24
thread
blocksize=1M
buffered=1
fsync=100

[rand-mixed]

rw=randrw
stonewall

Note: The setup is a bit biased to optimism as it is over a narrow 100GiB slice of the disk, and IO is entirely local and not over the network, and it is for block IO within a file, therefore no file creation/deletion metadata overhead. The outcome is a total of 100MiB/s, or around 17MiB/s per disk. With pure reading (that is without writing the the same data three times) the aggregate goes up to 160MiB/s, or a bit over 26MiB/s, and with 16MiB the aggregate transfer rate up to 170MB/s, or around 28MiB per disk. I have watched the output of iostat -dkzyx 1 while fio was running and it reports the same numbers. For another double check, CERN have seen over 50GB/s of usable aggregate transfer rate by 150 clients over 150 servers (each with 200TB of raw storage as 48×4TB disks), or 350MB/s per server.

As to comparative pricing, the same one-system capacities on Amazon's S3 for about 5 years have a base price currently of $30,000 40TB in S3 IA and of $33,600 for 80TB in Glacier, plus traffic and access charges; these are difficult to estimate, but I would add $6,500 to the S3 IA cost and $9,000 to the S3 Glacier cost.

Note: $6,500 is around the price of accessing around 20TB (of the 40TB) per year from another region (within region accesses are free) and reading back 10TB per year from S3 IA; $9,000 is around the price of restoring 1TB of the 80TB per year in 2 different months over 4 hours each time (10 downloads of 500GB each at 125GB/hour).

Rounding up a bit (10%) to account for other charges that apply may give around $40,000 over 5 year for 40TB of S3 IA, or $47,000 over 5 years for 80TB of Glacier.

The total cost of ownership of local system hardware plus a minimum system administration would have to compete with that. The purchase price of one such similar system is around $14,000 with similar or better SSDs and disks plus around $5,500 for a 12×4GB disks expansion unit (prices include 5 years NBD support but not tax), for a total of around $20,000 (the pricing I looked at is for SATA disks, but SAS NL disks are not much more expensive); to this one should add data center charges (for space, power and connectivity), and my rough estimate for that is around $12,000 over 5 years for colocation per system.

So the total cost of ownership for one of the cold-storage or archival servers mentioned at the beginning would be around $32,000 over 5 years, compared with $40,000 for 40GB of S3 IA, or $47,000 for 80GB of Glacier (both before tax). That $8,000 or $15,000 difference can pay for a lot of custom system administration, especially given that this is largely a setup-and-forget server. Considering that many sites have rack space and networking already in place the cash difference can be bigger.

The main advantage of Glacier of course is that it is guaranteed offsite, but then colocation can do that too, even if it is harder to organize.

Overall for a 5-year period my impression is that local or colocated cold-storage (or archive-storage) is less expensive than remote third-party storage, so for those who don't need a worldwide CDN that is still the way to go.

160706 Wed: First impressions of the ZOTAC CI323 mini-PC

I have been using the ZOTAC CI323 mini-PC for a couple of weeks. I bought it (from Deals4Geeks for around £150 including VAT, without memory and disk) to check a bit the state of the art with low power, small size servers; mini-PC are in general based on laptop components, and they may be considered as laptops without keyboard and screen. I have previously argued that indeed laptops make good low power, small size servers and I have been curious to check out the alternative.

Since my term of comparison are laptops, which usually have few ports, I have decided to choose to get a mini-PC that has better ports than the laptop, to compensate for the lack of builtin screen and keyboard, and the CI323 comes with an excellent set of ports, notably:

Compared to many other mini-PCs it also has some other notable core features:

The PCI device list and the CPU profile are:

#  lspci
00:00.0 Host bridge: Intel Corporation Device 2280 (rev 21)
00:02.0 VGA compatible controller: Intel Corporation Device 22b1 (rev 21)
00:10.0 SD Host controller: Intel Corporation Device 2294 (rev 21)
00:13.0 SATA controller: Intel Corporation Device 22a3 (rev 21)
00:14.0 USB controller: Intel Corporation Device 22b5 (rev 21)
00:1a.0 Encryption controller: Intel Corporation Device 2298 (rev 21)
00:1b.0 Audio device: Intel Corporation Device 2284 (rev 21)
00:1c.0 PCI bridge: Intel Corporation Device 22c8 (rev 21)
00:1c.1 PCI bridge: Intel Corporation Device 22ca (rev 21)
00:1c.2 PCI bridge: Intel Corporation Device 22cc (rev 21)
00:1c.3 PCI bridge: Intel Corporation Device 22ce (rev 21)
00:1f.0 ISA bridge: Intel Corporation Device 229c (rev 21)
00:1f.3 SMBus: Intel Corporation Device 2292 (rev 21)
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
04:00.0 Network controller: Intel Corporation Wireless 3160 (rev 83)
#  lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 76
Stepping:              3
CPU MHz:               479.937
BogoMIPS:              3199.82
Virtualisation:        VT-x
L1d cache:             24K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0-3
#  cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 480 MHz - 2.08 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 480 MHz and 2.08 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 480 MHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes

Given all this, how good it is? Well, so far it seems really pretty good. The things I like less are very few and almost irrelevant:

The things I particularly like about it are:

So I have decided, especially because of the very low power consumption, to use it as the shared-services server of my 2 home desktops and laptop. I was previously using for that one of the desktops that mainly is for archive/backups, and so has four disks, and draws a lot more power. I can now put that desktop mostly to sleep, and keep the CI323 always on.

so far I think it is a very good product, with a very nice balance of useful features and a good price, and as a low power server it is impressive.

Testers of this and similar models (1, 2, 3, 4, 5, 6) have looked at it as a desktop PC, or media PC or even as a gaming console, and they re seem pretty decent for all of those uses; the GPU is good enough for not the most recent games, and the overall performance envelope seems both good and isotropic.

160701 Fri: Cold storage, archiving, Blu-Ray, large disks

A somewhat fascinating story revolves around cold-storage and archival storage. They are two storage types that are designed to be used in a rather atypical way:

Note: recent backups should go to cold-storage, not archival-storage (1, 2).

Decades ago hot-storage used to be magnetic drums, warm-storage magnetic disks, cold-storage and archival-storage were both on tape.

Currently hot-storage is on flash SSD or low-latency and low-capacity magnetic disk, warm-storage on smaller capacity (300GB-1TB) magnetic disk, cold-storage on larger capacity (2TB and larger) low IOPS-per-TB magnetic disks, and archival-storage on tape cartridges inside large automated cartridge shelves.

Cold-storage is meant to be read infrequently and only for staging to warm- or hot-storage, and archival-storage is meant to be almost never read back, to be write-only in most cases.

The typical cold-storage hardware is currently low IOPS-per-TB disks in the 4TB-6TB-8TB range, arranged in 2-3 way replication or, at the boundary between cold- and archival-storage, using erasure codes groups that have a very high cost for small writes or any reads when some members of the group are unavailable, but have better capacity utilization at the cost of only a small loss of resilience.

What has happened in the recent past is that cloud storage companies have been offering all types of storage, and in particular for cold and backup storage they have offered relatively low cost per TB used.

That is remarkable because usually cloud shared computing capacity is priced rather higher (1, 2, 3) than the cost of dedicated computing capacity, despite the myth being otherwise. The high cost of cloud computing is only worth it for those who require its special features, mostly that it effectively comes with a built-in CDN that otherwise would have to be built or rented from a CDN business.

The most notable examples are Amazon Glacier and BackBlaze B2 and they are very different.

They have rather dissimilar pricing for capacity: between $80 and $140 per year per TB for Glacier, and $60 per TB per year for B2.

The cost of S3 capacity is 10 times higher than Glacier; there are other services like rsync.net that costs similarly to S3, and Google Nearline that costs similarly to Glacier.

The main differences are that Glacier has very strange pricing for reading, a highly structured CDN, and Amazon keeps the details of how it is implemented a trade secret, and B2 has a simple pricing structure, its implementation is well known, and has a much simpler CDN. Starting with B2, the implementation is well documented (1, 2, 3, 4):

In other words, this is just a cost optimized version of a conventional cold-storage system. As to this BackBlaze have published several interesting insights, notably:

As to Glacier instead there is a the mystery of the trade secret implementation: the Amazon warm-storage service costs 10 times as much, and it is known that it is a cost optimized version of a conventional warm-storage system, based on ordinary lower capacity disk drives. Potentially some clue is in the strange pricing structure (1, 2, 3, 4), for reading back data from Glacier:

Therefore if one wants to read 100GiB out of 500GiB it is 10 times cheaper to request 10GiB every 4 hours than request 100GiB at once, as the first incurs a read charge of 10GiB for the whole month, and the latter one of 100GiB for the whole month.

Note: there are other pricing details like a relatively high per-object fee.

This pricing structure suggests that Glacier relies on a two tier storage implementation, and the cost of reading is the cost of copying the data between the two tiers:

My guess is that the staging first tier is just the warm-storage Amazon S3. There has been much speculation (1, 2, 3, 4, etc.) as to what the bulk second tier is implemented with, and the two leading contenders are:

The Blu-Ray hypothesis has recently got a boost as an Amazon executive involved in the Glacier project has commented upon the official release of a Blu-Ray cold-storage system, Sony's Everspan, and he writes:

But, leveraging an existing disk design will not produce a winning product for archival storage. Using disk would require a much larger, slower, more power efficient and less expensive hardware platform. It really would need to be different from current generation disk drives.
Designing a new platform for the archive market just doesn’t feel comfortable for disk manufacturers and, as a consequence, although the hard drive industry could have easily won the archive market, they have left most of the market for other storage technologies.

This is quite interesting, because it seems a strong hint that Glacier does not use magnetic disks for its bulk tier; at the same it is ironic because BackBlaze has created a competitive product (for those that don't need a built-in CDN with multi-region resilience) based on commodity ordinary cold-storage 4TB-6TB-8TB disks.