This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
storage devices are based on memory chips with extremely
anisotropic performance profiles, in particular being unable
to overwrite data, being able to only reset data to zeroes in
large blocks, writing to them in small blocks is simulated,
and the simulation depends on recent history.
Some recent group tests from the usual excellent
of flash SSD devices show this clearly, for example
shows a graph of how much random write speed on a used device
is lower than on a virgin one, immediately after some previous
use, after a 30 minutes internal, and after using the
TRIM command to help the drive
firmware manage for best effect its
simulation of a writable device.
The differences can be rather significant, with many drives writing at less than 20% of their virgin-state speed after some use, and even after a 30 minutes pause; in the same all but one however revert to virgin-state speed after use of the TRIM command.
From the same group test another page shows that for some drives tested speeds are significantly reduced when the drive is 50% full.
It is notable however that the flash drives have still fairly good transfer rates compared to a disk drive; but I suspect that latency is more impacted than throughput.
Interesting article with a graph that shows a definite ramping up of IPv6 traffic as seen by Google, hitting 2% and after a long time with negligible impact.
It is also interesting that Teredo traffic has essentially disappeared, after being most of the IPv6 traffic they were seeing.
The adoption of IPv6 is clearly going to be dominated though by mobile devices (tables, smartphones), as demonstrated by previous news that even an USA-based carrier is using IPv6 for their most recent network.
Noticed an interesting thread on using ZFS as the storage layer for the OpenAFS meta-filesystem, started with an interesting series of questions:
- Are you using ZFS-on-Linux in production for file servers?
- If not, and you looked into it, what stopped you?
- If you are, how is it working out for you?
ext3/ext4 people: What is your fsck strategy?
The most interesting for me is the last one about
strategy, given my long standing interest
in the lack of scalability in the running time of most current
filetree repair tools, and the implication that ZFS does not
need a filetree repair tool like fsck.
This is a persistent myth, and it based on the half-true
notion that ZFS and BTRFS and other copy-on-write, versioning
filesystem designs don't need a filetree checking tool because
checksums it can detect
malformed parts of a filetree natively, without waiting for a
That they don't need a filetree checking tools to detect
malformed parts of the filetree when they are
accessed, but there is a still a need to detect
malformed parts of the filetree periodically even if
they are not otherwise accessed by applications. Indeed ZFS
has such a tool, and running it is called
Moreover hopefully the terminology switch above has been
noticed: I started talking about
filetree repair tools
and then switched to
filetree checking tools. Those are
two very different concepts, and while pervasive checksums
half-obviate and facilitate filtree checking they
don't obviate or facilitate filetree repair, which
requires in most cases that matter a whole-filetree scan and
some heuritistics to rebuild a well-formed filetree structure.
There is the argument that a versioning filesystem does not
need filetree repair tools because if the current version gets
malformed, for example by a system crash, it is easy to
rollback to a well-formed earlier version. But that argument
only covers the easy case of simple, easy damage of the sort
journaling filesystems were designed
for, to reduce the need to run filetree repair tools, but not
to eliminate it.
The damage that a filetree repair tool is meant for is typically that arising from limited storage layer issues, such as damaged recording media or coding mistakes which can cause random, not immediately reported, corruption of data.
The whole-tree scan and repair is worthwhile if it is a cheaper operation than a whole-tree recovery from backups, which is the case as a rule. The absence of such a tool means that backups and recovery from backups have to happen more frequently than if such a tool was available, which is not an insignificant issue.
ZFS and BTRFS and similar filesystem designs try to reduce
the need for more frequent whole-tree backups and recovery
from backups by using piecemeal storage redundancy, that is by
keeping two or more copies of at least some parts of a
filetree, so that if a copy is detected as malformed repair
can be simply restoring it from a non-malformed copy. But this
makes backups and restores incremental and online, reducing
their latency, which is good, but does not reduce much their
overall cost; ZFS at least a a
that does whole-tree checking and repair if possible using the
redundancy of the storage layer.
Besides the cost, the schemes used by ZFS and BTRFS for storage redundancy can be used separately with other filesystem designs, and they don't obviate the advantage of having a filetree repair tool either. ZFS and BTRFS integrated to some extent the filesystem layer with the storage redundancy layer, which makes it more convenient, and reduces perhaps the frequency with which to run filetree repair tools just like journaling does, but still does not eliminate it.
Never mind eliminating the need for conventional backups.
While reading a recent blog post about latency adding hours to a large database job:
That’s when I figured I’d ping the PostgreSQL server from the ETL server and there it was, 1ms latency. That was something I would expect between availability zones on Amazon, but not in a enterprise data center.
It turned out that they had a firewall protecting their database servers and the latency was adding 4-5 hours to their load times. The small amount of 1ms really adds up when you need to do millions of network round trips.
I was amused by the evasiveness of the analysis of the situation:
The details above strongly imply that the job runs in half-duplex mode, with each record update being synchronous, that is each record update is a transaction.
The 1ms latency has an impact on this job only because someone coded the job to be synchronous with each record update while processing 7 million records, which seems ridiculous to me; unless there is a strongly compelling reason for updating each record synchronously, whis is very rare for bulk ETL workloads.
One of the big stories about networking apart from the
difference between switched and routed internets
is the large difference between routing within an
(organizational internet) and routing across the global
The main difference is that routing in the global Internet needs a routing table entry for each independently routable IPv4 subnet, leading to IPv4 routing tables with nerly 500,000 entries.
This is due to the global IPv4 Internet routing aim of being based on that of an unrestricted mesh, where every subnet may be connected anyhwere perhaps multiple times and thus every router having to storage a complete list of routes to every subnet, even if almost all of those subnets are connected to a much smaller number of communication company subnet.
IPv6 was meant to have the opposite aim, of being based on nearly hierarchical routing, where local IPv6 prefixes are subsets of some communication company prefix, and thus can be subsumed in the latter prefix.
There is some debate (For example 1, 2) as to whether this will happen, and Internet IPv6 routing tables will be significantly smaller than IPv4 ones.
The debate is about whether IPv6 addresses will be hierarchical as intended, with prefixes for an AS all subsumed by one or more ISP prefixes, or whether AS owners will buy globally routable IPv6 prefixes, which can be quite cheap.
AS owners may want globally routable prefixes in two cases, when they have multiple uplinks to different ISPs, and to avoid changing IPv6 addresses when switching ISPs.
The first case of multiple ISPs is quite weak, because it is possible to advrtise a prefix from one ISP through another. The second case is where things get more difficult, because purely hierarchical addressing does require renumbering when leaving one branch of the hierarchy and joining another branch.
But in theory such renumbering should not be an issue, as anyhow IP addresses are just numbers, like Ethernet addresses, and should no be imbued with any special significance, as they are usually invisible to both applications and users, as both as a rule identify services by DNS entries.
However in practice IP addresses, and even Ethernet addresses, do get imbued with special meanings, for example as authentication tokens, or are not managed as flexibly as the DNS for example.
Therefore the demand for globally rather than hierarchically routable prefixes. Or for IPv6 NPT to be applied to ULA prefixes, which may be rather worse.
Fortunartely however for now as per one of the links above there are around 10 IPv4 prefixes per each of the 40,000 active AS numbers, while for IPv6 that seems to be 0.2 prefixes, as most ASes don't advertise IPv6 prefixes.
The prediction that every AS will eventually advertise the global Internet routing table at least one IPv6 prefix is not correct if hierarchical addressing is used, and indeed that is the main reason to use a hierarchical prefix scheme. Because while each AS may well advertise one or more IPv6 prefixes, almost all will only advertise them to their uplink providers, which will not readvertise them individually.
Right now it is not clear to anybody how this will play out. Those who think that there will be an explosion of routes like for IPv4 point at how cheap a globally routable IPv6 prefix is, and how convenient it is for someone buying one to squander Internet router resources for which they are not billed.
I was amused by the somewhat belated realization by a blogger that Chromium consumes 100% CPU on Google sites because that is exactly what has been happening for a long time. Most beowsers and sites are designed for the convenience of site publishers, not that of browser users and that convenience means capturing and driving the attention of those users to sell something to them.
Because after all the processing power needed to run that is free to the site owner, and therefore they have a strong incentive to use as much as they can to make their pitch; plus it is free to the people hired by the site owner to maintain the site, while writing efficient code costs them money.
It just occurred me that the extra 4 bytes for the VLAN tag in a tagged Ethernet frame are not necessary for frame addressed to ordinary addresses, because these are specified to be globally unique, and it does not matter in which specific VLAN they are, as least as to the primary function of VLAN tagging, which is to create limited broadcast domains.
Only broadcast frames need to carry a tag that matches those of the ports they are intended for, as they only need to be forwarded to the ports with the same tag, and since Ethernet broadcast frames are such by virtue of a single bit in the address, there is plenty of broadcast address space to put a tag in without extending the frame format.
But there is already a case for frame addresses with the broadcast bit on and the rest not all ones: multicast frames. VLANs could have then been defined or can now be reinterpreted as multicast groups.
The only real difference between a multicast group and a using tagged frames, apart from the syntactic difference of an extra few bytes of address, is that Ethernet addresses in different VLANs are not mutually visible. But this matters only if one uses Ethernet addresses or VLAN tags as authentication or security tokens, which is popular because it is expedient, but authentication or security should better be handled at a much higher level.
In the previous entry I referred to 6to4 as a fairly cheap way to work around the lack of IPv6 hardware routing in some less recent routing products.
Some time ago discussing the merits of switched and routed
internets I was astonished to hear that
someone reckoned that enterprise switches also capable of
routing could not route IPv4 in
hardware and that was why switched internets were common.
Hardware routing is a major issue because switches and routers are usually designed as embedded systems, with a low wattage, rather slow CPU with very limited memory, controlling one or more hardware modules doing the switching among modules over a dedicated backbone bus. The hardware modules have forwarding tables in dedicated memory, and dedicated chips that extract addresses from frames or packets and look them up in the hardware tables.
That used to be true many years ago: IPv4 packets have a
header field that must be decremented by 1 every time they
pass through a router, and a
must be recomputed
whenever a header field is modified, and doing that in
realtime using hardware logic used to be a difficult problem,
but that problem was solved over ten years ago.
The hard part is recomputing the checksum, and therefore IPv6 was designed without a header checksum, to facilitate hardware routing.
When routers could not route at line speed because it was done in software but switches could switch at line speed here was also a performance reason for having switched internets instead of routed ones.
So flooding and then updating of switching forwarding tables was seen as the cheap alternative to routing, either static or with RIP or OSPF (optionally having enabled ECMP) plus it enabled internet-ing of popular procotols that would not support routing, plus allowed spreading IPv4 subnets over multiple switches, enabling the use of IP addresses for purposes unrelated to networking such as access control.
Fortunately for several years enterprise-level so-called
level-3 switches (routers with an embedded
switch) have been able to route in hardware at the same speed
The only major difference is that router routing tables tend to be smaller than switch forwarding tables; mid-range products tend to have routing tables capable of handling hundreds to thousands of routes, while their forwarding tables can handle dozens of thousands of Ethernet address forwardings.
This difference means that in some site inernets it is not possible to use single-host routes for all nodes, which sounds extreme but is the equivalent of switching holding a forwarding table entry for all Ethernet addresses in the internet.
Because inter-switch frame forwarding is layer 2 routing, as some form of routing must be involved in a network made of multiple networks, and each switch defines one (or more than one) network.
As previously noted at length, I think that level 2 inter switch frame forwarding was not a good idea many years ago but it might have been expedient because of performance reasons.
Currently the performance reasons no longer apply and if the rather unnecessary evil of location independent IP addresses is desired it is possible to use single-host routes instead of relying entirely on per-Ethernet address forwarding tables.
Similarly autoconfiguration and resilience can be handled with a suitable OSPF and ECMP configuration and even as to this Ethernet frame forwarding can be avoided; especially as flooding-and-learning of individual Ethernet addresses is a fairly poor routing algorithm, especially compared to OSPF and ECMP.
I previously wrote that setting up 6to4 tunnel interfaces is best done on Linux by setting up two distinct ones one for the 6to4 prefix itself, as if it were a local network, for direct packet exchange between 6to4 nodes, and another one for the rest of the IPv6 address space, giving as tunnel endpoint that of a 6to4 gateway. I have also shown how to do this with Debian style /etc/network/interfaces configuration files.
Apparently something that is not obvious is that at the very least the 6to4 to 6to4 tunnel interface with its associated 2002::/16 route should be setup by default if one wants IPv6 connectivity and the node has a publically routable IPv4 address, even if the node already has a native IPv6 address with associated default (::/0 or 2000::/3) route.
Because it means that traffic with other 6to4 nodes does not need to go through an IPv6 router or 6to4 forward or reverse gateway at any point, and the cost of 6to4 encapsulation is quite small.
In effect 6to4 is so useful that it should always be setup if
the node has a publically routable IPv4 address and IPv6
connectivity is desired, and especially if it is a client,
which usually are relatively low traffic anyway. It can mean
avoiding setting up a native IPv6 address (if a 6to4 relay
gateway or interface is also setup), and each 6to4 address
free /48 subnet.
Which means that an IPv4 router for an IPv4 LAN can be turned into an IPv6 router for the associated /48 practically for free, and the nodes in that IPv6 prefix don't need to be 6to4 nodes themselves, just native IPv6, even if they should be because of the previous argument.
For a server with an expected significant traffic load the choice between both an IPv6 and 6to4 address or just a 6to4 address depends on whether the router upstream can function as a gateway between IPv6 and 6to4 gateway at high speed, because then there is no need to have a native IPv6 address: native IPv6 packets are routed to the upstream router, which then just encapsulates them 6to4 and sends them to the server, and viceversa.
This can be a very attractive option in one not so uncommon case: when one has a very optimized, highly reliable campus IPv4 campus net, but one that is expensive to reconfigure for IPv6 or that cannot handle IPv6 packets at all or at the same speed, for example becauser most routers don't have hardware routing for IPv6 or IPv6 routing is (unconscionably) an expensive additional option.
In this case one only needs to purchase or upgrade 6to4
border routers, accepting IPv6 traffic incoming, or 6to4
traffic outgoing, and the rest of the network can remain
cheaply IPv4 without much compromise. Thanks also to the
unicast routes (/32
for IPv4 and /128 for IPv6) and
the 6to4 gateway routes can be transparently multiple and load
balancing (or not, depending on route priorities).
In this way one trades the cost, in terms of engineering effort, hardware upgrades, license upgrades of routing IPv6 natively across a perhaps vast local internetwork for the relatively cheap cost of 6to4 encapsulation at a few border routers or LAN routers, or at any leaf nodes, which can be an attractive proposition.
How cheap is 6to4? For one data point we can compare native IPv4 vs. 6to4 transfer rates and CPU times on a 1Gb/s LAN:
# nuttcp -4 -t 192.168.1.34 836.4609 MB / 10.06 sec = 697.4131 Mbps 5 %TX 36 %RX 0 retrans 0.31 msRTT # nuttcp -6 -t fd00:c0a8:100::1 741.4141 MB / 10.07 sec = 617.3787 Mbps 5 %TX 53 %RX 0 retrans 0.29 msRTT # nuttcp -6 -t 2002:c0a8:122:: 718.3125 MB / 10.04 sec = 600.2773 Mbps 29 %TX 46 %RX 0 retrans 0.35 msRTT
# nuttcp -6 -r 2002:c0a8:122:: 602.0693 MB / 10.04 sec = 503.1609 Mbps 21 %TX 69 %RX 0 retrans 0.35 msRTT # nuttcp -6 -r fd00:c0a8:100::1 499.7539 MB / 10.05 sec = 417.1046 Mbps 12 %TX 57 %RX 0 retrans 0.34 msRTT # nuttcp -4 -r 192.168.1.34 691.3633 MB / 10.09 sec = 574.8813 Mbps 11 %TX 83 %RX 0 retrans 0.34 msRTT
As the unexciting native IPv4 rates indicate this is not a very high speed LAN setup, but the 6to4 numbers are within 10-15% of the native IPv4 numbers even if the CPU usage is rather higher, and a significant part of that can be attributed to the higher cost of processing IPv6 packets, as the native IPv6 transfer rates show (those using a non canonical form of fd00::/7 unique local addresses).
Note: it is unexpected that the 6to4 receiving rates for native IPv6 traffic are significantly lower than those of 6to4 packets, but I suspect that is due to the IPv4 code being more tuned for latency critical work like receiving than IPv6 code.
Note: The two nodes involved are an
i3 laptop and a
Phenom X3 desktop in low power mode,
maximum CPU speeds of 2.4GHz and 2.8Ghz, and both have very
consumer grade and somewhat old network interface chips, and
so is the switch chipset.
This scheme can be extended in another way: it is probably possible to publish routes to native IPv6 addresses via 6to4 router addresses, giving the possibility of nodes with native IPv6 addresses yet accessible only via an established, well tuned IPv4-only network infrastructure.
Sure, it would be better to avoid encapsulation entirely, and be able to afford upgrading a whole router mesh to support native IPv6 at full speed, but that may require a very large investment of engineering effort and hardware and license spending.
In particular because it is somewhat unlikely that anytime soon IPv6-only servers will be setup, having servers that are 6to4-only or have both native IPv6 addresses and 6to4 prefix-only addresses can be a good idea.
I have just read a (rarely)
of a recent and well regarded
SSD device, the
Among the more interesting aspects the is the extensive discussion of the particularities of flash SSD performance envelope, for example:
One area of performance that isn't mentioned often is the condition referred to as "steady state." This is the performance that will be experienced after an extended period of time with the SSD. We will be using Iometer testing to show the difference between a brand new shiny SSD, and that same SSD after it is loaded with data and subjected to continuous use over a period of time.
We will also use SNIA guidelines to place the SSD into steady state, and then test with our trace applications. Steady State trace-based testing will illuminate a bit of the difference between actual application performance after the SSD has been used for a period of time.
Another is the illuminating comparison of various drives as to minimum, average and maximum figures of merit, and well chosen ones, for example in particular maximum read latency:
|256GB OCZ Vector||0.812|
|256GB Samsung 840 Pro||0.819|
|480GB Crucial M500||1.554|
|250GB Samsung 840 TLC||1.710|
|256GB Toshiba THNSNH||0.716|
and maximum write latency:
|256GB OCZ Vector||0.784|
|256GB Samsung 840 Pro||0.720|
|480GB Crucial M500||3.414|
|250GB Samsung 840 TLC||0.735|
|256GB Toshiba THNSNH||13.540|
It would interesting to see maximum latencies in the case of interleaved reads and writes and I have seen tests that include those, but even just reads, and just writes, is highly informative.
The test also shows the difference between performance when new and used:
|256GB Toshiba THNSNH||250GB Samsung 840 TLC||480GB Crucial M500||256GB Samsung 840 Pro||256GB OCZ Vector|
Someone I know that is still a developer in a first-world country prefers not to change options from defaults when configuring applications because he thinks that applications only get tested with defaults.
His implicit expectation is that applications get written haphazardly, so they don't work at all to start with, in the sense that they are fundamentally broken, and then after many cycles of testing what gets tested is made to work, but only as far as it is tested, because most managers responsible for a software product don't see any point in wasting budget trying to get to work something that is not tested.
Admittedly I have seen quite a few cases where software gets written broken and then parts of its are made to work, instead of being written right with a few mistakes here and there.
It has worried me a bit to see that someone is worried that ext4 has many options:
I don't recommend nodelalloc just because I don't know that it's thoroughly tested. Anything that's not the default needs explicit and careful test coverage to be sure that regressions etc. aren't popping up.
(One of ext4's weaknesses, IMHO, is its infinite matrix of options, with wildly different behaviors. It's more a filesystem multiplexer than a filesystem itself. ;) Add enough knobs and there's no way you can get coverage of all combinations.)
I rekcon that in large part the ext code has been written more carefully than haphazardly slapper together, but it is older code that has been extended many times and the number of options is a reflection of that and can have some different bad consequences: