This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
• NEC won tender 4.5PB over threee deliveriesand from other notes on the same event:
• 500 TB already installed (450 x 400GB 7k SATA) connected to 28 storage controllers RAID6
- 500TB already installed -- 400GB 7k SATA drives and 50TB FC -- 140GB 15kabout the same presentation and I was quite amused: 450 drives (actually 400+140) over 28 RAID6s means over 16 per RAID6, which to me seems quite perverse.
- RAID 6 SATA
- 2 servers at 2 gbit every 20TB
- Jan 2007 -- 300GB FC and 500GB SATA
• 2MW today in near future .... 5.5 MW total into the building by end of year by 2017 predict 64MW
:-).orthogonal basefor configuration space via environment variables, of which I use three:
HULL is for names of systems, which implies a
	    certain hardware configuration, including for example
	    which filesystems are available.SITE is for names of locations, which usually
	    imply different networking setups for servers.NODE is for names of individual
	    configurations, and is a recent addition, as I realized that
	    several configurations may be needed on the same
	    HULL at the same SITE, for example
	    because of dual booting different GNU/Linux distributions,
	    or virtual machines, or different roles.case statements with pattern matching,
	something that the authors of most shell scripts I see seem to
	eschew in their quest for the most appalling style, as follows
	for example (many variants are possible, but a delimiter like
	the + below is necessary):
case "$SITE" in 'home+laptop1') : 'Whatever purely site specific';; ... esac case "$SITE+$HULL" in *'+laptop1') : 'Whatever purely hull specific';; ... esac case "$SITE+$HULL+$NODE" in 'home+laptop1+linux1') : 'Whatever for this specific situation';; *'+laptop1+linux1') : 'Whatever hull and node specific';; ... esacFor configuration files that are not shell scripts I use other variants on the idea:
/root/CONF/work+laptop1
	    might contain configuration files specific to that site and
	    hull. Then switching configuration can be as simple as
cp -alf /root/CONF/work/. /. cp -alf /root/CONF/work+laptop/. /.(even if I actually use a slightly different scheme and script).
make file to copy or generate on
	    the specialized configuration file, using the name of site,
	    node or hull in the make variables and the
	    filenames. For example:
${HOME}/.emacs: emacs-${SITE}.el; cp -p emacs-${SITE}.el '$@'
	    and sometimes I preprocess the files to be installed with
	    cpp which requires some slight trickery instead
	    of a mere cp:
${HOME}/.Xresources: Xresources
	cpp-dot < Xresources > .tmp && cp -p .tmp > '$@'
	    where .tmp is used to prevent overwriting the
	    older target in case the preprocessing fails, and
	    cpp-dot looks like:
ENV=''
CPP='gcc -E -x c'
test -f /lib/cpp		&& CPP=/lib/cpp
test -f /usr/lib/cpp		&& CPP=/usr/lib/cpp
test -f /usr/ccs/lib/cpp	&& CPP=/usr/ccs/lib/cpp
test -f "$LOCAL/bin/cpp"	&& CPP="$LOCAL/bin/cpp"
for VAR in HOME SITE HULL NODE
do
  eval VAL=\"\$"$VAR"\"
  ENV="$ENV -DEnv$VAR=$VAL"
done
exec $CPP $ENV ${1+"$@"} \
  | exec egrep -v '^[ 	]*$|^[#!]' \
  | exec sed 's/^  *//;s/ *\%\% *//g;s/\^^/"/g'
	    In the above there is a special trick: the sequence
	    %% is used where white space around it should
	    be deleted, as some variants of cpp insert
	    white space around expanded macros.
	    /etc/env-NODE, /etc/env-SITE,
	    /etc/env-HULL white typically read like this:
#!/bin/sh
export SITE
: ${SITE:='home'}
	    and a generic one that evaluates those and can override them
	    by injecting assignments from the kernel command line into
	    the environment like this:
#!/bin/sh
for S in /etc/env-SITE /etc/env-HULL /etc/env-NODE
do test -r "$S" && . "$S"
done
if test -r /proc/cmdline
then
  # This should be: tr ' ' '\012' | while read P
  # but cannot be because then the 'while' is a subshell.
  for P in `cat /proc/cmdline`
  do
    if N="`expr \"$P\" : '\([A-Z_][A-Z_0-9]*\)='`"
    then
      export "$N"
      V="`expr \"$P\" : \"$N=\(.*\)\"`"
      eval "$N"="'$V'"
    fi
  done
fi
	    Then the script above is sourcedat the beginning of the global
profile script for
	    users, or the rc
	    scripts for init.hdparm -T, and system CPU usage was
	around 11% for FireWire 400 and a still fairly low 28% for
	FireWire 800. Compare with the same figures for
	my USB2 test
	with much faster systems at lower data rates.
	non-Fibre Channelfor small-to-middle situations, which can add considerable flexibility to configurations.
# dd bs=32k if=/dev/hda of=/dev/sdi 777871+0 records in 777871+0 records out 25489276928 bytes (25 GB) copied, 903.105 seconds, 28.2 MB/sand
vmstat 10 was reporting:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ 1 3 4240 10208 676080 59140 0 0 25138 27612 2727 5682 2 19 0 78 0 0 8 4240 9944 679664 58456 0 0 29947 27198 2960 6169 2 21 0 77 0 2 2 4240 9608 670924 61236 0 0 25709 26650 2761 5985 8 19 0 73 0 1 5 4240 9396 682068 60600 0 0 30333 28282 2928 6029 7 24 0 69 0 0 5 4240 9800 686468 58512 0 0 27962 28345 2782 5709 2 20 0 78 0which is pretty decent, even if I have seen people claiming reaching 35MB/s with other ATA-USB2 chipsets (I haven't checked yet which one is in this enclosure). there is for both reading and writing a 20% CPU use on a 3GHz Athlon 64. Just reading from the same external drive gives much the same rate with a 10% use of CPU:
# dd bs=32k if=/dev/sdi of=/dev/null count=100000 100000+0 records in 100000+0 records out 3276800000 bytes (3.3 GB) copied, 117.09 seconds, 28.0 MB/swith
vmstat 10 reporting:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 1 4848 9952 574724 61264 0 0 27789 1 2318 4529 1 10 0 88 0 1 1 4848 9652 575004 61240 0 0 27202 0 2273 4442 1 12 0 87 0 1 1 4848 10320 574240 61260 0 0 27750 1 2363 4563 1 9 0 90 0 0 1 4848 9532 575164 61208 0 0 27251 0 2275 4438 1 12 0 87 0 2 1 4848 9892 576376 59660 0 0 27789 0 2316 4525 1 9 0 90 0 0 1 4848 10132 576192 59588 0 0 27265 0 2274 4441 1 12 0 87 0and neither transfer rate is disappointing. Overall a lot better than many others.
busbecause of that.
Our survey of technical applications indicated that typical HPTC programs spend the major- ity of their time waiting for memory. Ratios in the range of 5-80 floating-point operations per cache miss to main memory were typical.which is very well supported by my experiences (and Ben's, hi!
:->) in optimizing game
	code on various consoles, where more cache means better
	performance, given that most programmers, even game programmers,
	don't understand memory-friendly or microparallel algorithms.
	By scaling to N communicating processes, we are able to spread the data movement task over N independent memory access streams. Scaling is, of course, limited by the cost of communication.This is a bit too vague: the
costof communication is ambiguous as communication is rated in both bandwidth and latency, which can have very different cost profiles. For massively parallel algorithms latency probably matters more, but here it is not clear indeed whether the cost of bandwidth or the cost of latency has been targeted.
Our hardware design was guided by a simple idea: while traditional clusters are built upon processor designs that emphasize calculation speed, the SiCortex cluster architec- ture aims to balance the components of arithmetic, memory, and communications in a way that delivers maximum performance per dollar, watt, and square foot.This balancing is surely wise, and reminds me of a very good point by Edseger Dijkstra about optimal page replacement algorithms (to be discussed some other time), where optimum utilization of one resource is not the goal, but cheapest utilization of all resources (that is, cost-weighting). Unfortunately shallow customers (which exist in the HPC market too) buy on raw performance benchmarks and selling to the wise and discriminating restricts the potential market.
Our obsessive attention to low power resulted in a variety of performance and cost benefits. By holding down the heat generated by a node, we were able to put many nodes in a small volume. With nodes close together, we could build interconnect links that use electrical signals on copper PC board traces, driven by on-chip transistors instead of expensive external components. With short links, we could reduce electrical skew and use parallel links, giving higher bandwidth. And with a small, single-cabinet system we were able to use a single master clock, resulting in reduced synchronization delays. Our low-power design also has cascading benefits in reducing infrastructure costs such as building and air conditioning, and in reducing operational costs for electricity.Well said, and especially for lower end applications power requirements can impact costs severely. At the higher end not many have the resources of Google who can afford semi-custom PC designs and to build gigantic facilities where land and power are cheap.
The SiCortex node (Figure 3) is a six-way symmetric multiprocessor (SMP) with coherent caches, two interleaved memory interfaces, high speed I/O, and a programmable interface to the interconnect fabric.The usual MIPS-style instruction set, and a fairly decent amount of cache considering the CPUs are packed six to a chip. Sounds not too unlike a Sony/IBM Cell design, MIPS rather than PowerPC based, and double precision floating point is obviously targeted at scientific rather than gaming markets. Remains to be seen whether double precisions performance is much slower than single precision floating point as in similar designs. But the kicker is that:
The processors are based on a low power 64-bit MIPS® implementation. Each processor has its own 32 KB Level 1 instruction cache, a 32 KB Level 1 data cache, and a 256 KB segment of the Level 2 cache. The processor contains a 64-bit, floating-point pipeline and has a peak floating-point rate of 1 GFLOPs. The processor's six-stage pipeline provides in-order execution of up to two instructions per cycle.
This simple design dissipates less than one watt per processor core.which suggests a power draw of 6W per chip, which is fairly impressive. The 500MHz clock frequency however is far lower than the 3GHz clock of the Cell in the PS3, which however has only one such chip (even if with a similar number of CPUs), but then it is the only one such chip on that board, not one of 27 as in the SiCortex nodes. But then I also can only agree with the point that:
The processor's rather modest instruction-level parallelism is well suited to HPTC applications which typically spend most of their time waiting for memory accesses to complete.
interiornodes in the hierarchy cannot have data associated with them: in the DNS both a domain name and its subdomain names can have addresses associated with them.
srv1) being preferred for
	floor 1 and the other (srv2 for floor 2. Let's
	consider then these possible naming schemes, which I will
	present all at once so the differences can be seen at a glance,
	with case-by-case comments afterwards:
; There should be '$ORIGIN Example.com' or equivalent here. ; #1 srv1.Example.com. A 192.168.1.1 srv1.Example.com. A 192.168.2.1 srv2.Example.com. A 192.168.1.2 srv2.Example.com. A 192.168.2.2 ; #2 eth0.srv1.Example.com. A 192.168.1.1 eth1.srv1.Example.com. A 192.168.2.1 eth0.srv1.Example.com. A 192.168.1.2 eth1.srv1.Example.com. A 192.168.2.2 ; #2 NFS-1.Example.com. A 192.168.1.1 IPP-1.Example.com. A 192.168.1.1 NFS-2.Example.com. A 192.168.2.2 IPP-2.Example.com. A 192.168.2.2 ; #3 floor1.NFS.Example.com. A 192.168.1.1 floor1.IPP.Example.com. A 192.168.1.1 floor1.NFS2.Example.com. A 192.168.1.2 floor1.IPP2.Example.com. A 192.168.1.2 floor2.NFS.Example.com. A 192.168.2.2 floor2.IPP.Example.com. A 192.168.2.2 floor2.NFS2.Example.com. A 192.168.2.1 floor2.IPP2.Example.com. A 192.168.2.1 ; #4 ; #4.1 floor1.srv1.Example.com. A 192.168.1.1 floor2.srv1.Example.com. A 192.168.2.1 floor1.srv2.Example.com. A 192.168.1.2 floor2.srv2.Example.com. A 192.168.2.2 ; #4.2 srv1.Example.com. CNAME floor1.srv1.Example.com. srv1.Example.com. CNAME floor2.srv1.Example.com. srv2.Example.com. CNAME floor1.srv2.Example.com. srv2.Example.com. CNAME floor2.srv2.Example.com. ; #4.3 NFS.main.floor1.Example.com. CNAME floor1.srv1.Example.com. IPP.main.floor1.Example.com. CNAME floor1.srv1.Example.com. NFS.bkup.floor1.Example.com. CNAME floor1.srv2.Example.com. IPP.bkup.floor1.Example.com. CNAME floor1.srv2.Example.com. NFS.main.floor2.Example.com. CNAME floor2.srv2.Example.com. IPP.main.floor2.Example.com. CNAME floor2.srv2.Example.com. NFS.bkup.floor2.Example.com. CNAME floor2.srv1.Example.com. IPP.bkup.floor2.Example.com. CNAME floor2.srv1.Example.com.As to these:
flatfor my tastes, and does not take enough advantage of the hierarchical nature of DNS, and does not make it easy to switch from the main to the backup server for each floor.
search lines in
	    the resolv.conf (or equivalent DHCP) file,
	    and/or by changing the pointed-to interfaces.NFS.floor1.Example.com. CNAME floor1.srv1.Example.com. IPP.floor1.Example.com. CNAME floor1.srv2.Example.com. NFS.floor2.Example.com. CNAME floor2.srv2.Example.com. IPP.floor2.Example.com. CNAME floor2.srv1.Example.com.if one had clustering or load balancing for NFS and IPP acrosss the two servers.
