Software and hardware annotations 2008 June

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

080630 Mon Simple script for flexible SSH tunneling

Because of some complications with accessing files via OpenSSH through an SSH bastion I have developed a small script that makes that considerably easier:

#!/bin/sh

CMD="$1"; shift
PRX="$1"; shift

exec "$CMD" -o ProxyCommand="ssh $PRX nc %h %p 2>/dev/null" ${1+"$@"}

and that I have called sshx. This script can be invoked for example in these ways:

sshx ssh user1@bastion user2@target

sshx scp user1@bastion user2@target:path .

sshx sftp user1@bastion user2@target:path

using openSSH's own commands. Or it can be used as an SSH transport for RSYNC as:

rsync -e 'sshx ssh user1@bastion' user2@target:path/

or with LFTP as a transport for the FISH method:

set fish:connect-program sshx ssh user1@bastion
open fish://user2@target/path

The same effect can also be obtained by using SSH over SSH port forwarding, using this technique:

ssh -L 9022:target:22 user1@bastion
rsync -e 'ssh -P 9022' user2@target:path/

but it is less convenient, as the tunnel port must be chosen each time. There may slightly less overhead though, as the forwading is done directly by ssh instead of via nc.
This simple script has a weakness in that the options and arguments to the nested ssh for connecting to the bastion must be specified as a single argument which then gets expanded, thus with the usual issues with tokens containing spaces. This should not be a problem in almost every case that I can think of. It is also debatable whether there should be 2> /dev/null for nc: it only purpose is to suppress the Killed by signal 1. message by nc at the end of a session when the connection ends, and it might also suppress some more useful messages. Also of course the above is only an outline of a proper script.

080630 Mon Aggregate filesystems, the GoogleFS and Lustre cases

Many popular file systems seem to suffer some scalability limits, due both to the (lack of) speed of fsck even in the best cases and to limitations in the underlying storage subsystems (as it is arguably unwise to have RAIDs of more than a dozen or two disks even with RAID10). In both cases current solutions seem to be based on the Internet architecture as that has proven to be scalable and and highly resilient, at a price of course. The internet metaphor is to have coarse, mostly-read, distributed, coarsely parallel, network services arranged in a logical and physical hierarchy.
Two obvious (and most popular) examples of this tendency are the Lustre and GoogleFS. Both are designed according to a similar plan:

Separate metadata and data service; following the inspiration of the UNIX file system design with its very different data and metadata areas.
Move the implementation of the data handling side to the client, so that all data management is done by the client once the metadata has been obtained.
Achieve resilience or scalability by distributing at least the data over many nodes.
Use conventional filesystems as their storage devices for data and metadata, and files within conventional filesystems as if they were blocks (mostly GoogleFS) or extents (mostly Lustre).
Be intrinsically network based, as the communication among their parts is by distributed message passing, whether or not the parts are on the same or (more often) on multiple nodes.
Target performance to large files accessed sequentially rather than small files or random access.
Be able to fsck storage areas in parallel.

There are major differences:

In Lustre the metadata is not distributed except as an optional (not that optional) warm backup so it scales less well than GoogleFS, with the advantage of much lower metadata latency.
Lustre can be run entirely on a single hosts, and that makes a fair bit of sense; it is not clear whether to me whether that it is even possible with GoogleFS, and that however it is clearly not what it is designed for.
Currently Lustre mostly supports independent storage areas and splitting a single file across multiple areas, while GoogleFS allows for both redundancy and splitting of all files across multiple nodes. The splitting of a file into portions in multiple storage areas both allows files to be larger than any single area, and for different portions to be accessed in parallel.
The Lustre implementations seem to care a lot more about the performance of individual storage areas than the GoogleFS one.

My overall impression is that Lustre is aimed at high performance (including high write performance) large storage systems (hundreds to thousands of disk drives) using somewhat high end components (like Infiniband based RAIDs) while GoogleFS is aimed at one or two orders of magnitude larger storage systems, using cheaper components (just PCs, individual disks, Ethernet). Also, their scalability follows the two different alternatives about scalability discussed previously: 50 times or indefinitely.
Which reflects the aims of their makers I guess: the HPC market for Lustre and very large search engine farms for GoogleFS.

080626 Thu Distribution of services to servers, some examples

Chatting (a year ago) with a knowledgeable system person and more recently with another and it turned out that he preferred to allocate servers to services more than redundancy, and with the usual argument that the outage of a server would only lose that service. That seemingly reasonable argument is based on somewhat fragile assumptions, as I have argued before in the abstract, but perhaps some more discussion and most importantly examples make the point clearer, as I made in a conversation with another colleague. The two most fragile assumptions are that there is no tradeoff among:

Number of clients.
Number of services.
Server and service redundancy.

and that there are few dependencies among services. As to the tradeoff, the extremist solution is to have for each client multiple sets of servers each for a single service. So if one has K clients with S services and a desired degree of redundancy of N, one needs K×S×N servers.
As to there being few dependencies among services, it may be the case as to the bare computer service, but there may be usage pattern dependencies. For example there may be no explicit dependency between SMTP and NFS service, but it may be meaningless to users to have SMTP available if they cannot write the message they are fully able to send because the file server does not make available the mailbox containing the message being replied to.
Computer management is about achieving results at a lower cost, or else we wouldn't be using computers, and the cost of buying and running too many servers can be significant. Given a goal of cost efficiency one has to trade off between how many clients each server supports, and how many services per server and how much redundancy is feasible (and the type of redundancy).
A simple example is wanting to provide 4 services, for example home directories via NFS, mailbox access via POP3, printing via CUPS and sending mail via SMTP, for 40 users. One can imagine various scenarios, but let's consider the case where the purchase and support and power and cooling budgets allow for 4 servers:

One server per service, each serving 40 users.: One server each for NFS, POP3, CUPS, SMTP, each covering all 40 users.
Two servers providing NFS and POP3 to all 40 users, two servers providing CUPS and SMTP one of the two being active and the other being a backup.: Two active servers, one with 2 services (e.g. NFS/CUPS) and the other with the other 2 services (e.g. POP3/SMTP) and another two identical servers as backups, covering all 40 users.
Two servers providing all four services, each to 20 users.: Two active servers, each with all 4 services, each covering 20 users, plus 2 identical servers acting as backups.; Four servers, each covering 10 users, each providing all four services.

Note that here there is some unrealistic simplification, as I am not mentioning the impact of lower level services, like DNS and authentication; we can assume that these are happily mirrored across all servers.
Anyhow, in theory there are no direct dependencies between the POP3 and NFS and SMTP services, and each can run by itself, but in practice users will not be accessing their mailboxes if their home directories are available, and will not be sending many messages if they cannot access their mailboxes. Let's look at the consequences of outages in each case:

One server per service, each serving 40 users.: If the NFS server fails, the POP3/SMTP server continues to deliver, but all 40 users cannot work anyhow. If the POP3 server fails nobody can access mail, and if what the clients do is to process e-mail, having home directories is not that useful either, except to read old e-mail.
Two servers providing each 2 services, to all 40 users, one of the two being active and the other being a backup.: If the backup server fails, nobody notices except the administrator. If the main server fails all users are affected, after a small interval for switching to the backup server service resumes to all users.
Two servers providing both NFS and POP3, each to 20 users.: In this case an outage of the server for one half still leaves the other half of clients fully working without interruption, and it is possible often to arrange for the server for one half of the clients to start serving the other half in that case, even if with lower performance for everybody.
Four servers, each covering 10 users, each providing all 4 services.: With this each group of 10 users is fully independent of the others, but there is very little backup for each group unless they are redistributed to other servers.

In my experience the third alternative is the one that works best for most office and development environments, followed perhaps by the fourth alternative. Also because availability is not everything: maintainability is also very important, and the impact on that goes like this:

One NFS and one POP3 server, each serving 40 users.: Shutting down or restarting or just upgrading either server will require the agreement of all 40 clients, as it puts their work at risk.
Two servers providing both NFS and POP3, to all 40 users, one of the two being active and the other being a backup.: The backup server can be used as a test server, on which to attempt the maintenance operation, and if successful the active and backup servers can be quickly switched.
Two servers providing both NFS and POP3, each to 20 users.: Maintenance operations affect only half of the users, and that makes it much easier to obtain agreement.
Four servers, each covering 10 users, each providing all 4 services.: One needs to configure 4 servers, which are however broadly similar, and for maintenance one needs to take down a small group of users but four times, which can spread around time a bit.

Now that is a very simplified example, but I have some other example that are a bit more realistic. There are other considerations than availability and maintainability given a budget, for example configuration effort and susceptibility to human error or security issues rather than hardware fault, but for now my main focus is on availability and maintainability.
A good example is a development office with around 70 programmers, on two floors, with the following services to be provided: power, cooling, Ethernet, DHCP, NTP, DNS, NIS (authentication), FTP, IPP (printing), HTTP, HTTP proxy, POP3, SMTP NFS, SMB, CVS, NNTP, Jabber (chat), Kickstart (installation), Amanda (backup). The 70 clients need all the services except possibly FTP, IPP, NNTP, Jabber, Kickstart and Amanda to function during the day. The budget allows for two servers. Half of the services on one server and one on the other? Which halves?
I did have this case and I set up all services on both servers but because of local constraints I could not put 30 clients on one server and 30 clients on another. I was able to share the two servers over all clients for some services, those provided via broadcast or master/slave relationships, like IPP, HTTP proxy etc. In case of failure of one server it was easy to switch around the IP addresses of something like CVS.swDev.Example.com and CVS2.swDev.Example.com, as it was not required to have automatic failover.
The whole took a few days to discuss with the client and two days to implement (two rather long days...), plus another day or two to document. In part also because it is fairly easy to replicate a configuration across very similar servers, which is another good point about having multiple redundant servers or multiple servers per small user group.
For larger organizations I would use the same structure extended with some backbone server, separating services required per use from services that can be used by servers. For example time service, DNS, web caching and e-mail, which usually require some degree of centralized source or gateway or because there is a single entry or exit point per organization (the Internet link). But things like file and print service belong very naturally with workgroups, as there is usually a very high degree of time and spatial correlation between users and local networks; and when this happens, accessing from one local network resources on another local network is usually efficient than from all local networks to a central resource. The latter is only a definite win when most accesses come indifferentlyn from any local network, which is not the case in many situations.
Decentralized servers providing a full set of services to small workgroups works well as a logical topology in many common cases. Within the same organization it is however possible to map this logical structre onto equally decentralized or centralized physical or administration structures. As to these I prefer the physical structure to be equally diffuse, but less strongly. A centralized physical structure (where workgroup servers are in the same local even if they serve distinct workgroups) creates an implicit dependency among all servers even if they are logically autonomous, as power and cooling is shared among them. This creates both a bigger risk of common mode of failure, but also reduces the ability to perform maintenance on power and cooling, as the impact is then global instead of local.
As to administration instead centralization makes sense because it is static, as it amounts to discrete periodic actions and does not impact thr running of the services. Of course where configuration is supplied dynamically (e.g. DHCP) then it becomes a network service and should be decentralized; but then the DHCP configuration files can be defined and distributed statically from a central administration team. There are then scalability problems, but my impression is that the scalability problems of centralized administration are one or two orders of magnitude smaller than those for centralized physical location and centralized provision of services, simply because most of the time administration is not real-time.
However centralized administration still creates a common mode of failure, and having decentralized teams each with a full complement of skills (that is, not specialized by function) can greatly reduce the impact of any administration error or problem (for example when network administrators are in India, communication administrators in Canada, and there is a network problem in France). After all administration is also a service, just a less realtime one, and in addition every time that a boundary across human organizations is crossed there is a huge potential for delays and complications (for example the careers of the comms admins in Canada may be independent of that of the net admins in India and both independent of the users in France).
The overall principle is that the Internet is resilient and works well as a mostly hierarchical collection of small networks, each of them largely autonomous, and replicating this structure inside an organization tends to work well too, for those kind of usage patterns (office work, development, ...) that tend to resemble Internet usage.

080625 Wed Toshiba does Cell based laptop graphics

Something rather improbably has happened that is Toshiba have done a laptop with a Cell based graphics chip. This is rather remarkable on its own, but the appearance of a Cell style chip for graphics processing also supports a rumour that was published on some gaming site: that originally the Sony PS3 was to have two, one for graphics which was replaced by a GPU from nVidia. This speculation is reinforced by the report that this Cell based graphics chip runs hot: the PS3 one Cell does too, and two may well have been too much.

080621 Sat DBMSes, automatic and simple tuning

While chatting with a smart person about web site performance issues, one came up with which we were both familiar: large web server system load caused by queries on a DBMS with no or bad table/column indices. This reminded me of some interesting papers (one of which I already mentioned) by the guy who designed one of the original relational DBMSes (Ingres for the PDP-11):

B-trees revisited: An interesting argument that since B-trees waste about 30% of space for insertion slack in each index page, static indices are more efficient as they maximize fanout per page. Insertions are then less efficient as they must be made in overflow pages at the end of the array, but these can be periodically merged with the main static index by rebuilding/compacting it.
Retrospection on a Database System: Here the same guys who pointed our that static indices maximize fanout report that they discovered that not only people often did not even create indices, but if created they were often not rebuilt/compacted to merge back the overflow pages with new insertions, destroying performance by far more than the relatively limited hit caused by having slack space within each index page instead of just at the end.

In the latter paper the conclusion has a wider applicability:

Users are not always able to make crucial performance decisions correctly. For example, the INGRES system catalogs are accessed very frequently and in a predictable way. There are clear instructions concerning how the system catalogs should be physically structured (they begin as heaps and should be hashed when their size becomes somewhat stable). Even so, some users fail to hash them appropriately.

Of course, the system continues to run; it just gets slower and slower. We have finally removed this particular decision from the user's domain entirely. It makes me a believer in automatic database design (e.g., [11]).

as in general expecting software administrators to understand complex performance models or even if understood to tune accordingly is just a bit optimistic, and things still work so it is difficult to argue that something needs to be fixed. As I have argued previously the good suggestion is to design software around a simple performance model or even better, as the quote above suggests, with self-tuning properties. This is not always easy, because of potential positive feedback loops, but hysteresis usually helps. Anyhow I reckon that either continuous (as in a B-tree) or occasional (as for example in Google's BigTable) automatic garbage collections (that is, tuning the storage layout) are a good approach in general, even if in general I dislike automatic actions by software. For most cases automatic works better than manual, simply because manual does not happen.

080620 Fri The irony of Python and Lisp

While on a short holiday I had a chat with some other geek about languages, and that Python is really a variant of Lisp. Well, after all it has the same basic data types (including the pair and fundamental data structures, the same basic functions; and a very similar approach to object oriented features, using a property list approach to both classes and class instances (in this treading a path on which Vincennes Lisp and OakLisp had gone). The major ostensible difference is the syntax, but then Lisp 1.5 was only ever meant to be the internal representation of Lisp 2 which was supposed to have an Algol-like syntax.
Apart from the syntax the differences are all internal, and in particular two, which are very ironic:

The most popular Python implementation uses reference counting instead of garbage collection for automatic storage reclamation, which may be a bad idea.
Python implementations tend to be very slow, with some other arguably poor implementation choices.

As to Python performance, the irony is really huge, as Lisp had a reputation for many years, when computers where much slower and smaller than they are now, of being slow because of its advanced features. So Lisp implementors expended quite a bit of effort tuning their work, and there have been for many years compiled and interpreted Lisp implementations with really good efficiency. But ironically that no longer matters, and Python implementors can get away with really bad implementations.
Python programmers have done their bit, writing marvels of slowness like YUM (which has been slowly improved over the years). Few people seem to complain, in large part I suspect because most people tend to be rather more tolerant of slow GUIs than of slow command line ones. Sure there have been poorly written Lisp programs, in particular as the style of programming has to be a bit different in a language with automatic storage reclamation, especially if based on garbage collection; even if many Java programmers haven't been taking much notice of that though.
Perhaps it is all down to the syntax: Lisp syntax still intimidates, and so Python gets away with slow implementations.

080618 Wed Europe says to upgrade to IPv6

The IPv4 address space shortage has reached the popular media with a call by the EU to upgrade european networks to IPv6. For being a fairly old and rather uncontroversial technology the slow adoption may seem surprising, given the fairly obvious advantages it has. But there is here a classic cinflict of interests between the fully vested and everybody else: most of those who are making (sometimes a lot of) money using IPv4 see no positive reason to switch, as they are already fully vested. Even worse, the scarcity of IPv4 addresses will act as a barrier to the growth of competitors, making their vesting in the current state of affairs even more valuable. Worse stll, any cost they incur in suppoerting a switch to IPv6 will therefore mostly benefit their late-coming competitors.
The natural constituency for switching to IPv6 is made of those who would benefit from it: those with a stake more in the growth of the Internet than in its remaining as it is. That means some of the largest international sites like eBay, Amazon, Google, of course Microsoft (that is not particularly vested in the current edition of the Internet), late-coming countries and businesses, equipment manufacturers. And Europe: sure, some european countries (for example Norway) have been on the Internet since the beginning, but the current Internet is largely a USA dominion.
The ore compelling argument, that should persuade even those few in Europe that feel that they are vested in the current standards: the centre of gravity of the Internet will move from the USA to Asia, simply by sheer weight of numbers, and those numbers all but require that the Asian internet be IPv6 based. The choice for Europe is between being content to be just the minority partner in the old IPv4 based Internet already fully dominated by the USA, or to invest in becoming a minority partner in the new IPv6 internet to be dominated by Asia, but in which perhaps Europe has more of an opportunity.
As to me obviously I am not one of those who may fear that changing technology will weaken their IPv4-based career and vested interests (none to speak in my case). IPv6 is far from revolutionary, but it is a way to fix a number of weaknesses in IPv4, and the longer the change is delayed the lower the chances to implement those fixes and reduce the need for various brittle workarounds.

080616 Mon Google and Amazon "cloud" suitability

So I have been recently talking about cloud computing from Amazon and Google, and the kind of businesses it can make possible, in the context of thinking about outsourcing computer services and their nature as a a banking-like business, and I have been thinking about the underlying infrastructures needed. Well, I think that as to this starting from an e-commerce oriented infrastructure and culture like Amazon is better than starting from a search oriented infrastructure and culture, mostly for technical reasons.
The big technical reason is the prevalence or not of read-only access versus read-write transactions, and the while most traffic on cloud services involve the former, they critically rely on the latter. The technically difficulty here is that it is much easier to scale for a search engine than for an e-commerce infrastructure, for two reasons:

The vast bulk of the effort in a search engine is not in the spider or indexer, it is in the searcher, which is almost entirely read-only, and even the spider and indexer find most of the material they deal with is mostly read-only.
Even the read-write portions of a search engine infrastructure can be best-effort only: failed updates are not a killer. If some page updates fail to enter the index, not a big deal. Even if some for-pay clicks fail to be logged, not a big deal. As long as the read-write (and read-only) side delivers high (not even very high) rates of success, nobody is going to be that upset.

In e-commerce sites that's almost exactly the opposite: transactions are what matters, and getting all those transactions right and completed matters a greal deal. Sure, Google use locking internally, and they do transactions, but unlike Amazon they are not the core of the business (for most aspects that matter of Google's business).
However Google does have a side where transactions matter a very great deal: GMail (and I am ignoring the even more obvious Checkout as mostly not very Googly). Very much unlike the search engine part, GMail is based on transactions, it has high write rates (in particular because of floods of spam coming in), and must be reliable, because the only thing that people hate more than losing an order or a payment to an online catalogue site is to lose e-mail. Of course it is not just GMail: if one looks at the Google products list it is essentially divided in two sections: publishing products under Search and computing products under Communicate, show & share. The former are essentially non-transactional, the latter are essentially transactional, and are a much, much smaller part of Google's business, and probably an insignificant part of its revenues.
Conversely the Amazon product list is almost entirely about e-commerce, with some side forays into publishing, which support the e-commerce side.
One interesting difference is that Google's publishing services (advertising mostly, as Google is in effect a large advertising media agency even if it self-describes as a technology company) are wildly more profitable than those of Amazon, that is in effect a catalogue business like that of Sears-Roebuck, just online instead by phone or mail. In this however it has developed a more transactional infrastructure and culture than Google.
It is likely that users of cloud computing ([1], [2]) services will develop e-commerce rather than publishing sites, because they are likely to be small, and it is much easier to set up a small shop (especially a small service shop) than a small publishing business. Well, sure blogs and narrow-interest sites like say AnandTech are in effect small publishing businesses, but their business model is to get a small cut of the advertising revenue that Google receive thanks to them, and that may seem a bit less independent than generating one's own revenue by selling one's own services, especially if those services are supplied by the much cheaper workforce of a third world country.
However on reflection third party transactional services have one interesting advantage: that they are highly localized. A search engine has a very large global state (the index) with very few updaters and a very large number of readers, an e-commerce site has a very large global state (the ledger) with very many updates and a small number of readers, but a third party e-commerce site has very many small local states, even if they are updated frequently.
In this it falls between a search engine and an e-commerce site, and looks more like GMail, but a GMail with greater reliability, or more precisely guarantees of reliability, because a cloud computing platform will be in effect used to implement many small banks. As ot that I think that the competition is fairly open, and in this probably Google has a cultural advantage: Amazon do business mostly as principals, trading on their own account, while Google don't, as they piggyback on other people's publishing rather than doing their own (they surely do trade as principals as to advertising though).
But perhaps Google will go with its own grain and offer cloud computing services mostly oriented to publishing, and Amazon similarly mostly oriented to e-commerce, and they will exist side-by-side. Anyhow, it will be a rather interesting race, and some garage based startup as yet unknown may win it instead.

080615 Sun High quality realtime sw ray tracing

It is interesting to me that Intel have demonstrated realtime high quality sw ray tracing in a current generation rather popular game. The interest lies both in the demonstration of feasibility at a rather high quality level, and the colossal amount of hardware resources used, pretty large even for a contemporary system. But there is a profound irony therein: the additional cost for that extra CPU power is rather likely to be lower than that for a top end graphics card, as they often cost $500-$900.
The irony is even deeper as graphics card chips have come to be used as general purpose bulk vector processors, and the Intel person driving that project ascribed a large part of the feasibility of doing sw ray tracing to the use of SSE (and the related multiple ALUs in the AMD and Intel chips).
The frame rate achieved is not spectacular, but tolerable, and anyhow this demonstration is transparently just a teaser for some future Larrabee style product from Intel. The Intel strategy seems to have changed towards using the IA32 ISA everywhere, even in low power embedded processors, as they have realized that on chips with contemporary scale of integration the silicon (and realtime) cost of an IA32 ISA decoder stage is somewhat insignificant. Thus the fairly valiant attempt to push chips based on that ISA as graphics processors.

080608 Sun Separating computing workers from computing equipment

Shift of web jobs to third world countries, and most large first world companies are rather determinedly doing the same: eliminating jobs as fast as they can in the first world, while at the same time building physical infrastructure, almost entirely automated, in the same first world. That's because the first world has a large advantage in infrastructure services like power and water, and also in physical security and law enforcement.

Then there will be vast masses of entirely interchangeable programming labor units in the third world earning little more than the cost of living there, as their negotiating power will be very very low: if they were to strike or so much dare to ask for a raise or slow down work, their employer will just reroute the work to someone else.

The virtual-factory workers will not even have physical access to the factory that they will be working in, lockouts will take milliseconds, and the hiring of scabs only a little longer. This will put employers in physical spaces at a large disadvantage to employers in virtual spaces, as they will still have to deal with the annoyance of meeting employees in person and giving those employees physical access to their assets.

Which will mean that the trend for all jobs that can be performed in a virtual space to move to third world countries, and thus a reversal of migration to the first world, in particular for IT workers.