Software and hardware annotations 2007 May
This document contains only my personal opinions and calls of
judgement, and where any comment is made as to the quality of
anybody's work, the comment is an opinion, in my judgement.
[file this blog page at:
digg
del.icio.us
Technorati]
- 070528
A poor UNIX legacy: dotfiles in the home directory
- I feel that in recent and not so recent implementations of the
UNIX design, like
Linux, many of the aspects that are
simple and work well and are worthwhile are those have not been
changed from UNIX, and those that are complicated and work
poorly have been introduced by less gifted designers often
transparently inspired by the
Microsoft cultural hegemony
There is however at least one aspect of the UNIX tradition that
I feel is quite annoying, and it is dotfiles,
and in particular user configuration files as dotfiles in home
directories.
The convention not to list files whose name begins with a
dot is entirely arbitrary in particular because it is
implemented in user code. When just about the only program that
would list directories was ls
it might have been
tenable, but anyhow it was a wrong idea. The rationale was to
avoid listing the .
and ..
entries,
and was generalized to any files beginning with a dot. But the
more appropriate design would have been to decide that when
listing a directory the two special entries would never
be listed, even if present.
The other regrettable convention was to put configuration
files in the user home directories, which was facilitated by the
convention that by default dotfiles are not listed. Now I have
several hundred dotfile configuration files and directories, and
this has a couple of problems, the major one being that they are
easy to forget, precisely because they are not listed by
default. The minor one is that in many file systems directories
with hundreds of entries are slow, and a user's home directory
must be traversed in very many cases. The more appropriate
design would have been to reproduce the classic UNIX style
hierarchy inside a user home directory, and have a convention to
put all user configuration inside an etc
subdirectory
of the user's home directory, and not dotfiles. Too late now.
- 070527f
Thinking again about simplicity: laptop versus desktop
- Thinking again about
simplicity
the issue is often which simplicity to privilege. An example
comes from some discussion about what to buy, a laptop or a
desktop:
- laptop
- Well, a laptop is a self contained unit
that is very simple to use. It has all you need, no cables
hanging around, to repair it you just send it to the
manufacturer, it can be carried around in a bag. It is in
effect an appliance.
- desktop
- A desktop is simplicity itself as to
upgrading and repairing: you open it, plug in new parts, no
need to send it for repair to the manufacturer as most
breakages can fixed quickly by replacing a component.
Then it all depends on circumstances: a business may well decide
they don't need expandability, and standardize on laptops and
keep a stock of spare laptops to obviate the repair delay; I
have a desktop because for me it is much simpler to upgrade and
repair it myself. The same for the eternal debate between
bridging and routing:
bridging makes some forms of configuration simpler, routing
makes fault diagnosis and isolation simpler (but I can't imagine
any way in which RAID5 makes things simpler :-)
).
- 070527e
The problems with Wiki documents
- Whe writing documents in particular technical documentation I
prefer to write it in HTML (using a validating, DTD driven
editor like
XEmacs
with PSGML
or jEdit
with Xerces-J.
I have also been contributing occasionally to some Wikis, and I
much prefer using HTML. The one advantage of a Wiki is
shared editing, but this is paid for at a high cost:
- Wiki text can only be properly read by installing a web
server and a Wiki engine, HTML can be read with any
browser.
- There is no easy way to check Wiki text syntax, and it
lacks some of the useful details of HTML.
- The most common forms of Wiki text have a really annoying
syntax, and
the many different Wiki engines
often accept
rather different markup conventions
(for example
1,
2,
3,
4,
5).
HTML has a more verbose but equally simple syntax, and one
which is well standardized, so there are very many tools
that can be used to process it.
- A directory with a number of HTML files is easy to
grep or process, a set of Wiki
texts not so easy (unless the Wiki is based on text files
instead of a DBMS backend).
I reckon that HTML is simple enough to be used directly, and
none of the issues above, and because of the. Other people think
that Wiki syntax is less intimidating, and that's why it is
popular; but I think that Wikis are popular not because of a
simpler syntax (it is not to me) but because they avoid the need
to setup and maintain a web server, as the Wiki owners does
that.
- 070527d
Selling advanced switching features and split trunking
- Recently I had a meeting with a salesperson from a major
reseller of network equipment, and the pitch was about advanced
features in the routing switches they sell, and the attendant
consulting services. This was a bit inappropriate because the
network in which these are used is rather functionally simple
(just an
internet
of 10-30 laboratories) but
has huge performance requirements.
Anyhow the major pitch was for a very advanced feature
called
split trunking
which promises an amazing advantage: to have two outgoing
routers with transparent load balancing and failover; somewhat
like a
cluster system
(such as Heartbeat,
RHCS)
but for the gateway service
instead of a
higher level service like HTTP. If that does not sound amazing,
note that in general a leaf node in IP use only one router
address per route, e.g. 10.0.0.1
for the default
outgoing route for 10.0.0.0/8
. There are special
case exceptions, for example Linux has the
advanced routing
extensions, but in general
IP
is based on one active gateway per route. Transparent failover
of that one route can be done, but only by turning each node
into a router that listens for route updates, or pointing each
node to a local router that does.
But split trunking provided by this product allows one to
have a fully bridged internal
LAN (with VLANs), and then to
have a couple of routers to the outside that provide transparent
load balancing and failover even to nodes who don't have load
balancing routing extensions and without listening for route
update. This sounds absolutely fantastic: anybody with more
money than sense can look good by purchasing such a solution,
which automagically works without requiring any complications
like routing or advanced routing setup. Salesman's routing!
But to others it may sound impossible, and indeed it is if
it is attempted within the limits of conventional IP technology,
because it is based on the two routers having the same IP
address, and the same
Ethernet
address too for good measure. The whole trick is based on the
idea that any node sending their packets to the two gateway will
be attached to a switch with a learning table (Ethernet address
to port), and the same Ethernet address can be in two entries of
the table, and the switch can forward Ethernet frames across
both entries with the same address. Then when one of the routers
with the same Ethernet
MAC address
fails the entries pointing to it are unlearned (usually with an
explicit notification process), and traffic then flows only to
the surviving router, a hinted here (I have added some emphasis
to critical phrases):
Split Multi-Link Trunking (SMLT), a Nortel Networks
extension to IEEE 802.3ad, improves on the level of Layer
2 resiliency by providing nodal protection in addition to
link failure protection and flexible bandwidth scaling. It
achieves this by allowing edge switches using IEEE 802.3ad
to dual home to two SMLT aggregation switches. It is
important to note that SMLT is transparent to attached devices
supporting IEEE 802.3ad.
Because SMLT inherently avoids loops due to its superior
enhanced-link aggregation-control-protocol, when designing
networks using SMLT, it is not necessary to use the IEEE
802.1D/w Spanning Tree protocols. Instead, a method is used
that allows two aggregation switches to appear as a single
device to edge switches that are dual homed to the aggregation
switches. The aggregation switches are inter- connected
using an InterSwitch Trunk (IST), over which they exchange
addressing and state information, permitting rapid fault
detection and forwarding path modification. Although
SMLT is primarily designed for Layer 2, it also provides
benefits for Layer 3 networks.
This requires violating the spirit if not the letter of the
designs of both Ethernet and IP, but if everything works, it
gives transparent load balancing and failover without any
changes to client devices, and to the manufacturer it also
delivers the advantage of locking customers into their
proprietary software and protocol (because all the switches
involved, not just the two routers, must support the extensions)
to a high degree, even if the protocol has been
proposed for standardization.
What is the problem with violating the spirit or the letter
of existing designs? Reliability, usually, which depends on
simplicity and extensive
practice. One of my arguments about large bridged networks is
that routing exists for some good reasons, and routing software
(in simple situations) has been extensively tested. As to this,
I have seen a smart, experienced network manager attempt a few
times to implement in the field a split trunking configuration,
and fail for reasons that have proven hard to dagnose, and even
attempting it required having equipment all from the same
supplier, and specific software versions (as despite the
proposed standardization different software versions from that
same manufacturer are not quite compatible apparently). Even if
apparently it works in several cases and in the lab...
- 070527c
Wildcard domain names update and spam avoidance scheme
- While reading about some amusing issues with the evolution of
the DNS I
discovered that there is a somewhat recent
RFC 4592 on DNS wildcards
which one place clears up some points about DNS wildcards. I
am interested in those because my
spam
avoidance scheme uses wildcards. The main points about spam are
that:
- It is essentially free for the sender but costly for the
receiver.
- It can only be sent if the receiver's address is
valid.
This suggests that filtering (even
when effective)
is sort of silly as when it is received the cost of receiving
the spam message has already been paid, and apparently
85% of the messages
passing through most mail servers are spam. So one should
prevent sending of a spam message rather than throwing
it away after receiving it.
But there are only two ways to prevent the sending of a spam
message, the first being not giving spammers the address, and
the second retracting it after giving it. In practice only the
second way works, because e-mail addresses leak easily into
public space. The solutions is therefore to use disposable,
easily retracted addresses. But retracting an address means
inconveniencing welcome as well as unwelcome senders of e-mail.
The solution to that is to have a different e-mail address per
sender or at least per category of senders, so that if a e-mail
address is invalidated as it is used as a spam target only one
or a few welcome senders are inconvenienced.
The question is then how to get numerous disposable e-mail
addresses, and some people just use several registered with free
e-mail providers. But these addresses remain valid for a long
time, and just become spam-sinks, and do not prevent the sending
of the spam, they just shift the cost of receiving it to someone
else. There are some web based services that do forwarding via
periodically expired addresses, but these still have two
defects: the expiration is usually not under the control of the
receiver, and e-mail gets read by the service as it gets
forwarded by it.
One solution is to use a different domain per correspondent,
for example tony.mail.example.org
(for
people) or flickr.mbox.example.org
(for sites and companies), and then delete it if it gets spammed
and reissue the welcome senders with a new one. This is easiest
if one can run a DNS and SMTP server. But there is a better
solution that relies on the notion that most of these e-mail
addresses would not get spam, and explicitly adding one per
correspondent is too tedious.
So my solution is to run a DNS server and use wildcard
e-mail addresses too. Then I make up correspondent specific
domain names under the wildcard for private correspondence, but
I explicitly define a serially numbered one for places like
USENET News and web sites where a reply address can and should
be expired periodically, using domains like
news08.example.com
). It is also quite
important to make sure that the main domain name cannot be used
a mail target, by ensuring it does not have an MX
RR (and no A
RR ideally), and that any servers with
a published domain names do not have an SMTP server on them. The
wildcard scheme also requires an SMTP server that can accept
e-mail for wildcard domain names (for example
exim).
Now, if all domain names under the examples above are valid as
they are covered by the wildcard, how can one retract the few
that eventually get spam? Well, the rule is that a more
specific definition overrides the wildcard one, so just adding
any RR definition for
it. Therefore an example DNS zone file (in the usual
BIND syntax)
may look like in part:
$ORIGIN example.org
@ NS NS0
@ NS NS1
NS0 A 10.0.0.2
NS1 A 10.0.0.3
WWW A 10.0.0.2
SMTP A 10.0.0.3
POP3 A 10.0.0.3
; Wildcards
*.mail MX 1 SMTP
*.mbox MX 1 SMTP
; Number changed periodically
me08 MX 1 SMTP
; Invalidated
phat.mbox TXT "spammed"
tony.mail TXT "spammed"
anne.mail TXT "spammed"
- 070527b
When bridging or RAID5 are appropriate
- While I have manifested my
dislike of bridged networks
and amusement about RAID5
there are some special cases where they are applicable, and
usually in the small.
As to bridged networks I had some interesting discussions
with a rather smart person, and while we respect each other's
arguments, we found our difference boiled down to how large a
LAN should be: my limit being 30-50 nodes and a single switch,
and his being 500-1000 nodes and 10-20 switches. I still think
that an internet should be based on IP routing and not on
Ethernet bridging, and that the latter does not scale, but
sometimes it is expedient to use bridging as a
port multiplier
for a single LAN, up to a low
threshold.
As to RAID5 (in general parity RAID), there are only three
cases where it makes some sense:
- With exactly 3 discs, as RAID10 cannot apply, there is
some redundancy, and the speed penalty of a stripe size of 2
is not large. However a RAID10 over 4 discs is still much
much better for only a little more money.
- In the quite rare cases where applications write in
aligned stripe sized units, and written data does have large
speed or persistency requirements, and the stripe size is
small. Having applications doing aligned stripe
sized writes is not terribly easy, because one must know the
alignment to start with, which can be obscured by
partitioning, virtual volumes etc., and then arrange the
batching of writes. Using something like the
sunit
and swidth
parameters of the
XFS file system
is pretty essential (for both data and the
log/journal).
- Another smart guy from
BNL
persuaded me that in cases similar to his RAID5 is an
acceptable compromise, for what is essentially an online
cache of a much larger offline archive:
- The data gets written once and read many times.
- The data is already fully backed up.
- The main purpose of the small degree of redundancy of
RAID5 is to minimize service disruptions when a single
disk fails in one of the many subsets of the data,
mostly as a convenience, as continued operation at
regular speed is not essential.
It is indeed possible to get
somewhat reasonable performance
out of an undamaged RAID5 storage system, if those conditions
are met or one accepts significantly lower write performance
than for RAID10.
- 070527
RAID5 and bridging/VLANs are not free lunches
- While discussing the popularity of RAID5 (and other types of
parity RAID) and bridging/VLANs with some friends one of them
who worked for a popular RAID product supplier said that he
though that RAID5 is
salesman's RAID
; by this he means
that RAID5 is the ideal sales pitch as it promises both
redundancy and low cost, and of course excellent performance due
to inevitable proprietary enhancements. Similarly for VLANs,
that are presented as the ideal way to expand an installation
limitlessly and transparently in the most flexible way.
The pitches for RAID5 and VLANs are based on the premise
that difficult choices about difficult tradeoffs are not
necessary, and that anybody with more money than sense can
purchase a ready-made solution
that does not
require the effort to study issues like performance vs.
redundancy vs. cost or complicated stuff like routing.
- 070520b
Non uniform LCD monitor colour temperature
- Having recently purchased a second
LCD
monitor for my PC, an
LG L1752S
to complement my existing
LG L1710B,
I have realized by comparing them and trying to set them up so
they are visually similar that in both the colour temperature
depends rather noticeably (once it is noticed) on the vertical
angle of vision.
Quite frustratingly the two have opposite gradients: the
L1752S looks bluer at the top and yellower at the bottom, and
the L1710B looks yellower at the top and bluer at the bottom.
Also the older one (L1710B) tends to warmer colors than the
other; this may be because of an older backlight, or perhaps
because the newer one may have a newer
LED backlight
(anyhow the newer monitor's backlight is also noticeably
brighter). Also, the colour temperature gradient on the L1752 is
rather more pronounced than on the L1710B.
Another thing that I have noticed is that in the on-screen
menu for setting a custom colour temperature settings
significantly different from the middle ones (50) cause obvious
colour distortions, that is the visual effect is non linear and
strongly different for the three colours.
- 070520
Another RAID perversity
- From a reader of my notes an amusing (not for him I guess)
tale of RAID misconfiguration:
[ ... ] an NFS server that has two 12 disk arrays
attached to it. The drives are 250GB SATA devices, so there is
a total of about 3T of space in each array, 6TB raw space. On
boot, the devices and capacities are detected by the kernel, and
display these two lines:
SCSI device sdb: 3866554368 512-byte hdwr sectors (1979676 MB)
SCSI device sdc: 3866554368 512-byte hdwr sectors (1979676 MB)
I start wondering where one third of the array went, and start
digging. The former admin mentioned that they were configured
into a as "Two RAID 5+0 array[s ...] with a usable storage
capacity of 3.6TB Filesystems are striped across both arrays
using LVM2." Confused, I start running some calculations to
figure out how he managed to convert 6TB of space into 3.6T,
using RAID5.
It appears that there are actually four RAID5
arrays (two per physical array), each with one hot spare.
Within the RAID hardware, the two arrays are are striped
(RAID5+0). The numbers mostly work for this configuration:
(12 drives - 2 for spares - 2 for RAID5) * 250GB = 2TB per array
LVM2 is then used to stripe the two RAID5+0 arrays (does that
make it RAID5+0+0?)
At least
LVM2
is used in striping mode instead of linear mode; conversely I
have noticed that using LVM2 to linearly span a logical
volume across two physical discs is fairly popular.
- 070513b
The missing links
- One inevitable disappointment with the
WWW is that on many
sites the text in many pages has no link (or very few), for
example most of the articles linked-to in my
previous notes. For commercial sites this
is easy to understand: they believe that revenue, mostly
obtained through advertising, depends on
stickiness
and thus any links reduce revenues. This is not as relevant for
non commercial web sites which often index their texts with
links, notably at Wikipedia. Another
problem is felt at many sites: lazyness which leads to treat
hypertext pages as if they were mere text pages (which results
in many commercial web sites not having even links between pages
in the same site, except for boilerplate navigation). But the
intentional lack of links when it happens is based on
the idea that the stickiest web site has a lot of links pointing
to it, but none from it to any other site (unless that site pays
or equivalent, like
link exchange).
Now look at it from a distance: if everybody does that, and
all sites are dead ends, the WWW ceases to exist, because it
becomes a collection of self contained collections of text
rather than hypertext. Then the only connective linkage between
sites become
search engines
and
directories,
as to link one simply copies some portion of the text and then
pastes it into a search engine form; indeed this is happening
and some users not only have search engine forms as their home
page, but even type
URLs
in those forms instead of in the browser's address field. So
more power to text search engines and directories, as the WWW
devolves from hypertext back to text. But the most important of
those is Google and Google itself has
important financial impacts on sites, and again it is that
incoming links are rewarded and outgoing ones are penalized, as
that directly relates to advertising and sales revenue. However
there is a catch here for
Google:
its text indexing scheme is based on links as well the text that
contains them. So Google as well as advertisers are in effect
giving significant financial incentives to sites to undermine
not just the link based nature of the the WWW, but also that of
one of the most effective WWW indexing schemes, the one that
Google uses. Thus not just the avoidance of links, especially
those outgoing to other sites, within supposed hypertext, but
also the establishment of large fake sites with the only purpose
of providing large numbers of incoming links to those sites
which themselves have as few ougoing links as possible. It will
be interesting to see how this story evolves.
- 070513
PS3 very suitable for Folding@Home
- Interesting and unsurprising report that for some
embarassingly parallel
applications like
Folding@Home
the PS3 delivers very high performance
running a fairly nice
background application.
That is not surprising as each of the 6-7 secondary processing
elements in a Cell Broadband Engine
architecture
1,
2,
3,
4)
the vector processing elements can delivering a lot of power
being 3GHz each if the problem data set fits in 256KiB, which is
not too small.
- 070506c
On how to keep a job as a game programmer
- Entertaining advice from an article on how to keep a job as a
programmer, in the game industry:
"I know it's hard to keep focused on your current projects
and at same time watch the industry trends," Mencher says, "but
you must do this and ensure you're gaining the next generation
coding skills needed for the next killer job.
For programmers, it's never safe to stay focused on only
one technology or one platform. It is safe to keep learning
and increasing your skill base even if this means doing it off
work hours."
Well, I don't quite agree here: the single most important skill
for keeping a job as a programmer, not just in the games
industry, is to live in a low cost country, from Eastern Europe
to India or the Philippines. Those programmers without such a
skill will remain employed in the same way that naval engineers
remained employed in Glasgow. Anyhow the idea that game
employers discard those without the latest skills to hire those
who have them is just a symptom that there is a large oversupply
of programmers with recent skills.
- 070506b
Virtual machines as failures of existing kernels
- There have been a number of recent developments in virtual
machine software for x86 style systems; in itself nothing new, I
was managing a
VM/370
installation on an
IBM 3083
over a decade ago, and the IBM 3083 was nowhere as powerful as a
contemporary desktop PC. But there is a common thread here:
virtual machines are being used now, as then, as a workaround,
because the underlying kernels or operating systems are too
limited or too buggy. Many years ago it was to work around the
limitations of
OS/360
and its successors, in one of the SVS, MVT, MVS incarnations:
SVS had a single address space, so you would run multiple SVS
instances in different virtual machines, MVT did not have
virtual memory, so you would run even a single instance in a
virtual machine to get around that, and MVS had poor support for
production and development environments, so one would run the
two environments in separate virtual machines. The same is
happening now, largely because of the same reasons; the one
inevitable reason to run virtual machines is the obvious one, to
run completely different operating systems, but let's look at
the other more common reasons for doing so:
- To consolidate multiple servers
- This is by far the most common rationale. Well,
server consolidation is an improper goal: why run
multiple virtual servers on a single real machine? What
makes more sense is service consolidation and a
single server can well be used to consolidate multiple
services. The problem with this is that existing operating
systems make it awkward to merge configurations, to support
multiple ABI versions, etc., so that it is
sometimes easier to consolidate by just creating a snapshot
image of existing servers and then run them in
parallel.
- To provide isolation between different services
- This is a safety and security issue, and it is the saddest
reason for using virtual machines, because service isolation
and containment are in effect the whole rationale for
operating system kernels. That it may be of benefit (and I
agree that in practice it might be) to add a second layer of
isolation and containment beneath the kernel is sad.
Admittedly this is already done in high security sites by
running separate hardware machines (on separate
networks in separate physical locations), but here we are
talking about ordinary sites whose service isolation and
containment needs should be satisfiable with a single
kernel.
- To run a different execution environments
- This is another sad one. The idea here is that one might
want to run applications relying on different ABI levels.
But hey, isn't that something that can be solved easily by
installing different ABI levels concurrently and then using
things like
paths
to select the right one
for each service application? Well, if only it were that
easy. There are famous distributions with package and
dependency managers that make even just
installing different versions of the same library awakward.
It can be easier to just create a wholly distinct system
image and instance just to work around such limitations.
- To run a different operational environments
- In this case one wants to have different environment for
example for production, testing and development. There is
some point here in running multiple system instances in
virtual machines here. The reason is that switchover between
testing and production, for example, depends on verification
and a sharp changeover, and stability, and just switching
images is indeed the easiest way to do that. Just because
tools to verify the stability of an environment are
difficult to write.
Except for running wholly different operating systems sharing
the same hardware I prefer, if at all necessary, weaker
virtualizations, like for example my favourite
Linux VServer
which arguably should be part of the standard Linux kernel, as
it provides resource abstraction functionality that should
belong in every operating system. And never mind ancient but
lost in the mists of time concepts like
transparent interposition
.
- 070506
One of the first towns to get IPv6
- From BusinessWeek an article about
a USA town switching to IPv6.
One can only hope that this is the start of a trend, as the
limitations of IPv4 are really starting to matter. Even if I
suspect that
NAT will still be used with IPV6
but then only by choice, not because there aren't enough
addresses.
- 070505b
A recent RAID host adapter does RAID3
- While looking for eSATA host adapter I was somewhat startled
to find that
XFX
a company usually known for their graphics cards has launched
the
Revo
a line of SATA RAID host adapters which support
RAID3
with models with 3 or 5 ports (reviewed for example
here).
It is not clear to me if it is really RAID3 or (more likely)
RAID4, that is whether striping is by byte(s) or by block. The
XFX product is based on a
chip by Netcell.
Fascinating though, as RAID3 has become less popular in time,
even if it makes
occasional unexplained reappearances.
- 070505
Copying a 500GB disk over eSATA
- Numbers from copying a whole 500GB disk from an internal SATA
drive to an external eSATA drive:
# ionice -c3 dd bs=16k if=/dev/sda of=/dev/sdb
30524161+1 records in 30524161+1 records out
500107862016 bytes (500 GB) copied, 10596.3 seconds, 47.2 MB/s
Cheap backup at 47MB/s (average over outer and inner tracks)
can't be much wrong, especially compared to 12-15MB/s for
Firewire 400 and 20-25MB/s for USB2. The only comparable
alternative is
Firewire 800
but my impression is that external cases and host adapters for
it are far more expensive than for eSATA.