This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.
[file this blog page at: digg del.icio.us Technorati]
Flash
SSDs
come also in a format similar to memory sticks, with a
PCIe compatible interface, and
they
tend to be quite fast too.
In a related discussion I was pointed at this fairly dramatic video on how hot becomes the CPU chip on the stick and over 100C° seems quite hot to me, and other reviews confirm that even for relatively modest operations involving a few GiB of IO the CPUs on other sticks can become similarly hot (1, 2).
The heat of course has an influence on the durability of the adjacent flash chips, as flash technology is fairly sensitive to heat; more so than the CPU chips themselves that probably are rated for those temperatures.
I am quite astonished that these memory-stick like drives don't come with even a minimal heat spreader and dissipator, as DRAM memory stick often (and usually pointlessly do).
I have also briefly worried about temperatures inside 2.5in form factor boxed flash SSD drives, but I think that there are two important differences: usually since the base board is rather larger the CPU chip is far away from the flash chips, and usually it is connected to the metallic case via a terhmally conductive pad. Since after all the total power draw of a flash SSD is of the order of a few watts the metallic case is most likely amply sufficient. The problem with memory-stick like devices is that the power draw and thus the heat is very concentraced in the small CPU chip and builds up.
But there are exceptions to this, and I noticed recently that
a flash SSD 2.5in PCIe drive
that is rated for peak power draw of 25W
has
heavy cooling fins.
A very long running test of flash
SSD
by the often excellent TechReport
involved rewriting constantly to them to verify their
endurance
and its has recently finished after a
starting a little more than 18 months ago.
The
results
are quite impressive: the shortest endurance was for 700TB of
writes, and the longest was for 2400TB written over 18 months,
at a rate of over 120TB per month, with breaks, as the
products where reset with a SATA
secure erase
every 100TB written.
The speed of the units showed little sign of degration until the end. But it is interesting to note that the continuos write speed is not quite the same as that for somewhat limited duration tests, as shows in the graph at the bottom of the page as it can go from 75MB/s to 250MB/s, and one (particularly good) devices oscillated between 75MB/s and 120MB/s.
The authors also report on the
numbers reported by the devices
as to erase block
failures and they follow
for most devices a fairly reasonable, steady rise but only
after hundred of TBs have been written, and the reserve of
fresh erase blocks tends to last a long time. This implies
that
leaving a small part of a flash SSD unused
can indeed give benefits; and probably not just higher
endurance, which in most cases is pointless, but probably
lower latency and jitter for writes.
As the authors point out for typical single user workloads
endurance is not an issue as
the SSDs in my main desktop have logged less than two terabytes of writes over the past couple years
which is near the
4GiB/day (1.5TB/year) written on my laptop.
The endurance test by TechReport shows that ordinary consumer
grade flash SSD drives have pretty in practice huge endurance
that amounts to many decades of desktop or laptop use. These
endurances are not warranted, and as previously
reported flash SSDs graded as enterprise
have
warranted endurances of 3500TB or even 7000TB of writes
(1,
2),
both compare well or favourably to the specified endurances of
magnetic disk drives, because as I pointed out previously
magnetic disk drives also have duty cycles
for example 180TB per year for 3 years (archival drive) or
550TB per year for 5 years (enterprise drive) of total write
and reads. Every product has a duty cycle.
Flash SSDs have improved a lot in a few years as to
the wear leveling
done by their
FTL
and also as the latency and jitter caused by wear leveling,
even if erase block sizes have increased a lot (I think
currently most flash chips have 8MiB erase blocks), and I
think that my previous conclusion that
at those levels of endurance these enterprise flash
SSD drives might well be suitable for replacing 2.5in SAS
10000PRM or 15000RPM disk drives
.
What ordinary flash SSDs probably will never be suitable for
is long term offline storage, per the little known detail that
I mentioned previously that they
lose data after as little as 3 months without power
while enterprise
flash SSDs have huge
capacitors that presumably can extend that significantly.
For several years I have been using some shell functions which are part of my standard env shell startup script. These functions allow setting from the command line the label (title) of an xterm or konsole window or a screen/tmux or konsole tab.
I have recently updated them cosmetically because they
contained embedded raw
escape sequences
that were annoying to print; the updated versions have now all
those escape sequences encoded, and the printf
command or shell builtin emits the raw sequences.
The first function sets the window title of the current xterm window. It also works with other graphical terminal programs that use the same escape sequence.
# Set window title: \033]0;MESSAGE\007 xtitle() \ { case "$#" in '0') set "$HOST"\!"$USER";; esac case "$*" in \!|\@) echo 2>&1 'No arguments or default!'; exit 1;; esac printf "\033]0;$*\007" }
This function sets the title of the current tab/window in the screen command. It is often quicker than using the the relevant prefix command.
# Set SCREEN tab title: \033kMESSAGE\033\134 stitle() \ { case "$#" in '0') set "$PS1";; esac case "$*" in \!|\@) echo 2>&1 'No arguments or default!'; exit 1;; esac printf "\033k$*\033\\" }
This function sets the title of the current konsole tab.
# Set tab title: \033]30;MESSAGE\007 ktitle() \ { case "$#" in '0') set "$PS1";; esac case "$*" in \!|\@) echo 2>&1 'No arguments or default!'; exit 1;; esac printf "\033]30;$*\007" }
The last one sets the color of the current konsole tab, but it seems no longer effective in recent konsole versions.
# http://meta.ath0.com/2006/05/24/unix-shell-games-with-kde # Set tab title color: \033[28;RGBt # where RGB is in *decimal*. kcolor() \ { case "$#" in '0') set 255 32 0;; esac case "$*" in \!|\@) echo 2>&1 'No arguments or default!'; exit 1;; esac RGB=`expr "$1" \* 65536 + "$2" \* 256 + "$3"` printf "\033[28;${RGB}t" }
Having recently mentioned the issues arising from increasing disk capacity without increased access rates and slicing disks to reduce the impact of that I have reread one of the first mentions of this issue in a very interesting presentation by the CERN IT department in 2007 on Status and Plans of Procurements at CERN where one of the pages says:
Large Disk Servers (1)
- Motivation
- Disks getting ever larger, 5 TB usable will be small soon
- CPUs ever more powerful
- Potential economies per TB
- Constraints
- Need a bandwidth of ~ 18 MByte/s per TB
- Hence everything more than 6 TB means multiple Gbit connections or a 10 Gbit link
- Networking group strongly advised against link aggregation
That presentation mentions the bandwidth of ~ 18 MByte/s
per TB
and when the presentation was made I asked the
author what that meant and he told me a concurrent transfer
rate of a read and a write thread of around 18MB/s per TB,
and it was designed to prevent vendors from bidding storage
systems based entirely on large disks that did not have
enough IOPS to sustain a reasonable workload.
I asked the same author a few days ago whether CERN procurement continued to have the same requirement and I understood from his reply that CERN no longer have it in practice because for systems that merely archive data it does not matter, and they are requiring flash SSDs (that certainly satisfy the requirement) for systems that do data processing, in part because ever larger disks as anticipated no longer have enough IOPS-per-TB.
Overall the problem is to have balanced systems, that is
where there is a ratio between CPU, memory, storage capacity
and IOPS to match the requirement of the workload. One of the
oldest rules was the VAX
rule of one
MiB of memory per
MIPS
of CPU speed (and at the time typically one instruction was
completed per Hz), and somewhat amusingly many current systems
still have that ratio, for example server systems with 16 CPUs
at 2GHz having 32GiB of memory are fairly typical.
The original VAX-11/780 had that ratio, and also typically around 500MiB of disks (typically 2×300MB disks) capable of around 100 random IOPS together, with a memory/storage ratio of around 500 and an IOPS-per-TB ratio of around 200,000.
A typical 16 CPU server with 32GiB of memory and two 500GiB SSDs capable together of 80,000 IOPS has a memory/storage ratio of around 30 and an IOPS-per-TB ratio of around 80,000 which is a factor of 2-3 lower than a few decades ago.
That seems acceptable to me considering that for many applications CPU processing time has gone up. For CERN the greater CPU-memory ratio to everything else has enabled them to greatly improve the quality of their analysis and simulation.
I recently argued that large disks should be sliced and RAID sets built across such slices, so I was amused to read (can't remember where) an interesting guess as to how the Amazon's service Glacier is so much cheaper than online storage but has such long delays for first access and this relates to my other post about The low IOPS-per-TB of large disks.
The idea is that Amazon may be
short-stroking
large disks
by using only their outer cylinders (like
some vendors also apparently do)
to improve the IOPS-per-TB of their online storage and
therefore has the rest of the disk available for low, low
priority accesses. Presumably if this is true the long delay
for first access to Glacier is due to the data being copied
off the inner cylinders to staging
disks
very slowly (and intermittently) to avoid interfering with the
accesses to the outer cylinders.
I was recently discussing IPC methods, in the usual terms of
message passing
and shared
memory
:
The main advantages and disadvantages are:
The two common techniques above can be used with or without dividing memory into independent address spaces using virtual address translation, but then I mentioned that if the latter is available there is a much better and long forgotten scheme which was designed as part of the MUSS operating system several decades ago, which relies on the availability of virtual memory:
processcontains a thread and a collection of data
segments.
This combines the best of message passing and shared memory, because it involves neither data copying nor data sharing: the crucial property of this scheme is that a data segment is only ever attached to just one process at a time, so it is never shared, but since it gets detached and attached by updating the virtual page tables, there is no copying of data either.
Note: the above is a simplified version
of the MUSS scheme, which had another important property with
additional conspicuous advantages: segments are not implicitly
mapped
into the address space of the
process they are attached to. This means that detaching and
attaching a process is in many cases a very cheap operation,
just involving updating a kernel table. Thanks to virtual
memory segments could be either memory resident or disk
resident, with disk resident ones being
page-faulted
into memory (if mapped) and modified pages saved back to disk
using ordinary virtual memory mechanisms.
Note: the MUSS scheme was generalized elegantly to network communications: if the source and target processes are on different systems the data in the segment can be moved (by copying its to the target and deleting it from the source) over the network entirely transparently. This is thanks to the essential property that segments are only ever attached to one process.
In MUSS this method was used for everything: characters and lines read from a terminal by the terminal driver were sent as segments to the login manager and then to the newly spawned command processor; files were segments opened by sending a segment with their name to a file system process, which would then reply with a (disk mapped) segment being the file.
But even the much better MUSS scheme can be substantially improved, as it occured to me long ago, because:
remote procedure call, where the data queued by the source process is in effect arguments, and the data queued back by the target process is in effect return values, and the operation more than queueing is actually pushing arguments and then result values on a conceptual stack.
Therefore my scheme moves both the stack segment carrying the arguments and the thread from the source process to the target process. The segment will be most conveniently the stack segment of the thread itself. Thus a system will have a number of address spaces some of which will have no threads, and some of which will have several threads.
Note: I discovered this scheme by looking at and generalizing the MUSS scheme. A somewhat similar scheme in the Elmwood operating system was inspired by the RIG operating system project, which is part of a different path of kernel design (1, 2) and that inspired the Accent operating system which in turn inspired the better known Mach kernel.
Each address space will have then a specific entry point where threads that are moved to it start executing, and a set of segments containing the code implementing the service and the state of that service. Creating an address is the same as creating a process, where an initial thread initializes the address space, but with the difference that the thread will end with an indication to keep the address space after its end.
The service could be for example a DBMS instance where the initial thread opens the database files, initializes the state of the DBMS, and then end. A thread wanting to run a query on the database would then push the name of the operation on its stack, the query body, and move to the address space with the DBMS instance, by requesting the kernel to switch page tables and to set its instruction pointer to the address space entry address, and the code there would look at the first argument being the name of the operation, invoke it with the query from the second argument, and push the query results onto the stack and move the thread back to the original address space.
Note: when a thread moves out of an address space into another one there must be a way to ask the kernel to create for it a return entry point and push its identity onto the thread's stack. For this and other details there can be several different mechanisms, omitted here for brevity.
Note: among the details this scheme is quite easy to extend to full network transparency, thus allowing threads to move to address spaces in other systems, and it is also easy to allow for mutual distrust between source and target address space.
There is no explicit locking involved in moving a thread from one address space to another, because it is the same thread that moves, thus no need for synchronization. Once the thread is in the address space of the target DBMS instance there could be several different threads so there must be locking, but that would happen also for concurrent execution of message passing requests, and obviously also in the shared memory case.
An address space without threads is in effect what in
capability systems
is called a
type manager
in that threads can access the data contained in it
only by entering it at a predefined address of the code
contained in it.
My IPC scheme is thus in effect a software capability system framework with a coarse degree of protection, but one that is quite cheap, unlike most software capability designs that rely on extensive operating system work.
Note: it is a framework for a system as it does not define how capabilities are implemented, it only provides a low cost mechanism for implementing type managers as address spaces. The code withing them can implement capabilities in various ways, for example as in-address space data structures and then hand out to type users just handles to them, or as encrypted data structures returned directly to type users.
As I wrote previously about data access pattern advising fadvise with POSIX_FADV_DONTNEED and POSIX_FADV_WILLNEED does not advise about future accesses being sequential, but is about already read or written data, and has an instantaneous effect, that is it has be repeated after each series of data has been read or written. That is quitquitee unwise.
That for example means that tools like nocache have to do extra work, and bizarrely may need to issue it twice for it to have effect.
Since those two bits of advice are in effect operations rather than advice they can have further interesting complications as described here in detail and can be summarized as:
page cacheunconditionally, if it was in use by other processes it will have to be read into the page cache again.
The solution to the latter issue proposed by in the linked page is to ask the OS the list of pages of the file already in the buffer cache, before reading or writing it, and then issue POSIX_FADV_DONTNEED only on those that don't belong in that list.
The definition of the POSIX_FADV_DONTNEED operation in particular is rather ill-advised (pun intended!). In theory the access pattern advices POSIX_FADV_SEQUENTIAL and POSIX_FADV_RANDOM should obviate the need for both POSIX_FADV_DONTNEED and POSIX_FADV_WILLNEED, but in practice they don't seem to work at all or well enough for Linux, requiring the use of lower level and awkward operations.
As previously argued the stdio library and the kernel of a POSIX system should be able to infer which advice to use implicitly in the vast majority of case from the opening flags and the first few file operations, as was the case in the UNIX kernel on the PDP-11 just 40 years ago...
In one of the blog posts mentioned as to the news
about a new 10TB disk drive
has some comments reporting
(1,
2)
that a USA startup has designed a 1U rackmounted custom
flash
SSD
unit with a capacity of over 500TB.
That's for a completely different target market: it is all about reducing cost of operation, from power to rack size, while delivering vey high IOPS-per-TB. But for very rich customers it can be used as a archival too.
Note: To me it looks like a fairly banal product that could be slapped together by any Taiwanese or Chinese company, but in inimitable Silicon Valley style the company that developed the product raised around $50m in venture capital funding and was valued several times that when it was sold. Yes, as one of their people in the relevant YouTube presentation says high density flash is not entirely trivial to design and make reliable, but perhaps that valuation is a bit high for that.
In interesting news the first shingled tracks disk drive has been released and has a huge 10TB capacity. There is no information on the price, but 8TB conventional drives currently cost around $724 (+ taxes) so hopefully the new 10TB drive will cost less or else it does not make a lot of sense.
Shingled tracks drives require special handling as they have large write blocks, larger than read ones, so updates require OS support for RMW. Given that an 8-10TB drive is bound to have terrible IOPS-per-TB, whether it uses shingled tracks or nor, it is going hopefully going to be used for archival, almost as a tape cartridge with fast random access, so the large RMW problem is going to be minimized.
There have been two trends in computing that to me seem underappreciated, and they are that access times for both DRAM cells and rotating disk have been essentially constant (or at best improving very slowly) for a long time, while capacity and transfer rate for RAM and disks have been growing a lot.
Note: the access time of a DRAM cell has barely improved for a long time, but the latency of a DRAM module has become a bit worse (in absolute time, rather worse in cycles) over time as DRAM cells get organized in ever wider channels to improve sequential transfer rates.
The situation is particularly interesting for rotating disks because access times depend on both arm movement speed and rotation speed, and while arm speed may have perhaps doubled over a few decades rotation speed has remained pretty constant at the common rates of 7200, 10000 and 15000 RPM.
The result has been that rotating storage IOPS-per-disk have been essentially constant around the 100-150 level, and this has meant that IOPS-per-TB have been falling significantly with the increase in disk capacity.
This often is bad news for consolidation
of computing infrastructures because often the VMs end up
sharing the same physical disk, where the original physical
systems had their own disks. On paper sharing a large capacity
disk among many VMs, where the capacity of physical system
disks was often underused, seems a significant saving, until
IOPS-per-TB limitations hit, because of two reasons:
Things are also bad when wide parity-RAID sets are used in an
attempt to gain IOPS, because when writing those parity-RAID
sets get synchronized
by RMW involving
parity, which much reduces aggregate IOPS.
So the big question becomes how many VM virtual disks to
store on each physical disks sharing IOPS, given their access
patterns. Often the answer is not pleasant because it turns
our that consolidating disk capacity is much easier and
cheaper than consolidating
IOPS.
I have been illustrating some secondary properties of the RAID14 scheme that I mentioned some time ago for people who object to RAID10 having the geometric property that it does not always continue to operate after the loss of 2 members and I found that some of its properties are not so obvious, especially when comparing to a RAID15 scheme that is also possible:
I think still think that RAID10 is much preferable to RAID14, given likely statistical distributions of member losses, but the arguments above may make RAID14 preferable to RAID15 in many cases.
Some years ago I started using a
UMTS
3G
(also known as mobile broadband
) USB
modem
for Internet access from various
places, and with some reception issue in some places. To make
sense of these I wanted to see signal strength and other
status, but suitable status tools were mostly only available
under MS-Windows. I found one GNU/Linux tool but not very
reliable.
So I have investigated and found out that most UMTS 3G modems
have a very similar interface to a serial line modem with an
AT
command set (once upon a time called an
Hayes command set) with UMTS 3G
extensions that are fairly well standardized.
The modem appears under Linux as three serial ports, usually ttyUSB0, ttyUSB1 and ttyUSB2. The first is used for data traffic, but usually ttyUSB2 reports status, and/or can be used to give commands to setup status reporting.
So I have written a small perl script (download here) to do that. It is not as polished as I would like, but it seems to work pretty well. However I eventually realized that its overall logic should be different. As it is currently it sets up the modem to report every second its status, and then parses the status messages. What it ought to do is to disable periodic reporting, and instead sleep for some interval, and then request the current status, and repeat.
Its single parameter ought to be the serial device to use, which defaults to /dev/ttyUSB2, and typical invocation and output are:
$ perl umtsstat.pl hh:mm:ss RX-BPS TX-BPS ELAPSED RX/TX-MIB ATTEN.DBM MODE CELL 10:23:22 20 0 000:00:02 0/0 -67 (21) WCDMA. 6C09C2 10:23:24 0 0 000:00:04 0/0 -67 (21) WCDMA. 6C09C2 10:23:26 0 0 000:00:06 0/0 -67 (21) WCDMA. 6C09C2 10:23:28 0 0 000:00:08 0/0 -67 (21) WCDMA. 6C09C2 10:23:30 0 0 000:00:10 0/0 -67 (21) WCDMA. 6C09C2 10:23:32 0 0 000:00:12 0/0 -67 (21) WCDMA. 6C09C2 10:23:34 0 0 000:00:14 0/0 -67 (21) WCDMA. 6C09C2 10:23:36 42 84 000:00:16 0/0 -67 (21) WCDMA. 6C09C2 10:23:38 126 84 000:00:18 0/0 -67 (21) WCDMA. 6C09C2
Is surprises me sometimes still how high are the expectations by users of magical or even psychic behaviour by filesystems, and a recent example is the discovery by some users as neatly summarized here that some filesystem create rather fragmented files if applications don't give hints and then proceed to write large files in small increments and slowly.
Shock and horror, but what can the filesystem do without psychic powers enabling it to predict the final size of the file or that it will continue to be written slowly at intervals for a long time?
Some filesystem designs try to reserve some space the current end of an open file to accomodate further appending writes, and some do so even for closed files. Some if they have several files open for write at the same time will try to enable reserving space for growth by deliberately pushing those files far away from each other, which has its own problems.
But all these choices have severe downsides in not so
uncommon cases. The results are familiar to those who don't
expect magic: when downloading large files over slow links
they often end up as many extents
too,
and the same happens with tablespaces
of
slowly growing databases, and similarly for data collected by
instruments at intervals, and many other log-like data
patterns.
This happens because of limitations are three different levels:
In the instant case of a program like syslog or an imitation that writes relatively slow small increments over a long time it would be the application writer skill to use where available hints for access patterns and for file size.
I have illustrated access pattern hints previously and these ought to help a lot, along with preallocation hints (fallocate(2), truncate(2) with a value larger than the current size of a file) but then application programmers often don't use fsync(2) which is about data safety and similarly don't check the return value of close(2).
There is a large ongoing issue with the Linux kernel, and it is only getting worse: while wisely the basic kernel system call interface is kept very stable, there are a significant number of kernel interfaces that are not system calls and change a lot more easily, for example the /proc interface and the /sys interface or the socket operations NETLINK interface.
Note: the common aspect is that they use core kernel data interfaces for metadata or even commands.
In other words just like in Microsoft's Windows subsystem designers design their own kernel APIs largely voiding the aim of having a unified and stable system call interface.
To some extent this is inevitable: after all at least some critical services may well be implemented in user space using whichever communication mechanism and API style to define special protocols. But it becomes rather different when it is kernel interfaces that are defined using arbitrary mechanisms, because the kernel system call interface then becomes largely irrelevant because most programs end up using many other interfaces besides.
Maintaining the system call API clean and stable migth have been of value when programs expected few and simple services, for example when networking and graphics were uncommon, and the kernel system call API was more or less coextensive with the stdio library.
What perhaps might have been better for the long term would have been to define a simple communication scheme, with a well established style of using it, and then used it for both process-to-kernel and process-to-process (client-to-daemon) requests.
But that course of action never proved popular, in part
because most of its proponents only wanted a
process-to-process communication scheme because they were fond
of microkernel
designs which turned out to
be less optimally feasible on general purpose
CPUs.
The result is that Linux is in effect not just a monolithic kernel with a low overhead system call API, it is an agglomeration of a number of specialized monolithic kernels with their own more or less low overhead APIs; what all these kernels share is some minimal low level infrastructure, mostly memory, interrupt and lock management.