Computing notes 2013 part one

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

130414 Sun: Using Aptitude to maintain the list of packages

Aptitude is probably the best APT front-end in particular because it has a full screen user interface, without it being a GUI, so it is easy to use over a terminal connection, and has a remarkably powerful query language for the package collection it manages, which includes both installed and available packages. In distributions that use DPKG it has the usual limitations the latter has, but that's inevitable.

I have recently used it to clean up a bit the list of packages on one of my systems, so as to eliminate unnecessary packages, and to ensure that packages are appropriately tagged as being automatically or manually selected.

Note: APT and Aptitude are about dependency management, and keep track of which packages depends on which other packages, or are required by them, and which packages have been installed intentionally by the system administrator and which have been installed automatically purely to satisfy the dependencies of other packages. The latter can be automatically removed if no installed package requires them anymore.

The Aptitude queries useful for this are based on using these predicates (and their negations):

The roots of the dependency graph are essential packages and manually installed packages. Since essential packages should always be present they should usually be marked as manually installed (even if it does not make much of a difference). To mark a package one can use the most common Aptitude commands on the line listing the package, and they are:

Some commands help displaying package lists. Usually I prefer a rather shallow package list or the New Flat Package List instead of the deeply nested subsections Aptitude displays by default, and the commands that change the display are:

Therefore the first check is to use L with ~E~M to list all essential packages that are marked as automatically installed, and use m to mark them manually installed. There should be no packages of the system's architecture in the list selected by the ~E!~i query.

The other check is to use the ~g query to look at installed packages that seem unneeded according to APT, and either purge them with _ or mark them as manually installed if they are intended to be installed. In general unneeded run-time and compile or link-time libraries should be purged.

It may then be useful to review the manual/automatic installed mark for several categories of packages:

The overall goal is to ensure that the state of installed packages is well described, so that for example ~g will list packages that can be purged without perplexity and that all such packages are listed by it.

130409 Tue: SQL SELECT styles and join indentation

My introduction to DBMSes was based on the original Ingres on a PDP-11 and its Quel query language that was based on the relational calculus and was quite nice, so I have largely continued to think in those terms even when events decreed that most DBMSes had to be queried in SQL.

SQL allows queries that resemble relational calculus, for example against a system inventory database as:

SELECT
  n_d.dept	AS "Dept.",
  l_u.user      AS "User",
  l.name        AS "List",
  n.snum        AS "Serial",
  n.name        AS "Node"
FROM
  nodes		AS n,
  lists         AS l,
  nodes_depts   AS n_d,
  nodes_lists   AS n_l,
  lists_users   AS l_u
WHERE n.snum = n_l.snum
  AND n_l.list = l.list
  AND n_l.list = l_u.list
  AND n.snum = n_d.snum
  AND (n_d.dept = ANY ('sales','factory')
       OR l_u.user = 'acct042')
ORDER BY 
  n_d.dept,l_u.user,n.snum,l.name

In the the SELECT and ORDER BY clauses describe the output table, FROM lists the relations over which the calculus will operate (similar to the range of in Quel), and the WHERE defines which tuples will end in which rows of the output table.

Note: that I use table to indicate the result, as the result is not a relation as it has an ordering of its rows, while tuples in a relation are not ordered.

The DBMS is required in the above to figure out that various constraints, and in particular that some equality predicates among different tables imply join rather than selection operations.

An important detail in the above query is that equality constraints for joins and equality selections look syntactically identical, while perhaps it would have been better to indicate equality constraints for joins with a different symbol, for example ==.

Anyhow SQL is based more on relational algebra, and while the above calculus-style query is possible, standard SQL-92 does not have a disjunctive equality predicate to indicate outer joins, like for example the Sybase *= to indicate left outer join. They need to be written in a syntax that makes clear that SQL-92 is based on relational algebra, in this fashion (omitting the SELECT and ORDER BY clauses which remain the same):

FROM
  nodes				AS n 
    LEFT JOIN nodes_lists         AS n_l
    ON n.snum = n_l.snum
      LEFT JOIN lists               AS l 
      ON n_l.list = l.list
      LEFT JOIN lists_users         AS l_u
      ON n_l.list = l_u.list
    LEFT JOIN nodes_depts         AS n_d
    ON n.snum = n_d.snum
WHERE
  n_d.dept = ANY ('sales','factory')
  OR l_u.user = 'acct042'

In the query styled as in the above all joins must be indicated almost in an operational way in the FROM clause, and the WHERE clause describes only selection predicates.

Therefore the FROM clause no longer declares the relations over which the query will be defined, but describes the algebra that builds a joined relation starting from a seed relation such as nodes, WHERE as filtering the resulting relation by tuple selection, and SELECT as projecting the resulting relation into a subset of its fields.

As such I have come to think regretfully that it is the best way to write queries within the SQL-92 design. I would prefer a calculus based query language, but SQL-92 and variants are not really that, and using SQL calculus-like syntax may be a bit misleading, in particular as only some joing types may be written in calculus-like form. Some variants of SQL like Sybase and Oracle have them, and the Sybase one is fairly reasonable (the Oracle one is quite tasteless), but even those fundamentally go against the grain of SQL-92.

In this vein I have indented the FROM clause above to show the logical structure of the algebraic operations involved (as Edsger Dijkstra suggested such a long time ago), as the SQL-92 join algebra does not allow parenthesis to imply grouping. So here each level of nesting describes a relation to be joined with the next level of nesting.

The indentation aims to suggests tt that the nodes relation (indentation level 1) is to be joined with nodes_depts and nodes_lists which implicitly defines a new relation (indentation level 2) which is to be joined with the lists and lists_users relations.

SQL-92 allows going the whole way round and adding tuple selection predicates to the tuple joining predicates in the FROM clause, in a fashion like this:

FROM
  nodes				AS n 
    LEFT JOIN nodes_lists         AS n_l
    ON n.snum = n_l.snum
      LEFT JOIN lists               AS l 
      ON n_l.list = l.list
      LEFT JOIN lists_users         AS l_u
      ON n_l.list = l_u.list
    LEFT JOIN nodes_depts         AS n_d
    ON n.snum = n_d.snum
      AND (n_d.dept = ANY ('sales','factory')
 	   OR l_u.user = 'acct042')

But I regard this as bad taste, and I am not even sure that the example above is indeed valid, because of the position of the OR l_u.user = 'acct042' condition: where it is in the example is probably in the wrong context, and moving it so the FROM clause seem to have a rather different meaning to me:

FROM
  nodes				AS n 
    LEFT JOIN nodes_lists         AS n_l
    ON n.snum = n_l.snum
      LEFT JOIN lists               AS l 
      ON n_l.list = l.list
      LEFT JOIN lists_users         AS l_u
      ON n_l.list = l_u.list
	OR l_u.user = 'acct042'
    LEFT JOIN nodes_depts         AS n_d
    ON n.snum = n_d.snum
      AND n_d.dept = ANY ('sales','factory')

In any case I think that joining and selecting predicates are best separated for clarity, as they have rather different imports.

On trying similar queries on some relational DBMS with an EXPLAIN operator all three types are analyzed in exactly the same way, as they are fundamentally equivalent, at least because of the equivalence of relational calculus and algebra.

130405 Fri: Another significant UNIX platform flaw: low use of archives

The lack of distinct effective and real owners for inodes in UNIX (and derivatives) is perhaps its major flaw, but time has revealed some other significant flaaws.

To be clear I am talking here of flaws in the design of UNIX qua UNIX, that is within its design style, not of the flaws of the UNIX design style itself.

Among these the first is the lack of a generally used library providing structured files as a commonly used entity. By this I means something like ar files. Even if relatively recently libarchive has appeared, and it has been too late.

Unfortunately structured archive files and easy access to their member files is a quite common requirement, especially in all those cases where data accumulates in a log-like way, which are many.

While in the UNIX design libraries of object (.o) files are archived in ar files (.a) that almost incomprehensibly is almost the only common live use of archives.

The only other common use of archives is both incomplete and shows both the consequences of not using archives and the reason why they are not commonly used other than there not being a ready-made library for them: mbox style mail collections.

Note: while ZIP archives also exist, they are not commonly used in the UNIX/Linux culture, and tar archives lack many of the conveniences of general archive formats, and they mostly some kind of hibernation format for filetrees.

In effect an mbox file is an archive of mail messages, and is mostly handled with similar logic: mbox archives are mostly read-only, they are mostly appended to, deleting a member is usually handled by marking the member deleted and then periodically rewriting the archive minus the deleted members.

But there is no convenient command like ar to manipulate mbox files as archives, even if most mail clients allow to manipulate them interactively in some limited fashion.

The mbox archive anyhow has been long on the wane being increasingly replaced by MH/Maildir style filetrees, and that shows one of the reason why a standard archive library has not been written early or widely adopted: it is always possible to replace an efficient archive with a directory linking to the individual members. This scales very badly, but most software projects, whether commercial or volunteer, live or die by a small scale initial demo, so abusing directories as containers or directories as simple databases is an appealing option, one however that is rarely revisited later.

Indeed the database case, where BDB and similar libraries have been available for a long time, and yet many application use directories as keyed small record containers, can invite skepticism as to whether the early availability of an archive handling library would have helped.

But considering the case of the stdio library, which is widely used for buffered IO, I think that if something similar had existed for ar style archives they would have been far more widely used, in the numerous cases where there are mostly read-only collections of small files, for example .h and similar header files, or .man manual page files.

130404 Thu: Nagios or Icinga, a better configuration style

Recent various talks about monitoring systems and in particular Nagios or Icinga have reminded me of a topic that I should have posted here a long time time.

Nagios and Icinga have some very nice way to design a configuration, which are the orthogonal inheritance among node types and dependencies among them, but also relationships between host objects and service objects and between each of these and their service groups.

To indicate such relationships there are two styles, one of which is rather brittle and very commonly used (e.g. 1, 2, 3, etc.) the other is rather elegant and flexible, but is amazingly rarely used.

The first way looks like (I haven't actually tested any of these sketched examples of configurations, and some parts are omitted) this:

define service {
use MainStreet-templsvc
check_command check_DHCP_MainStreet
host_name SV01
service_name DHCP-service-MainStreet
service_description service for host configuration
}

define host {
description server for DHCP DNS in Main Street DC
use generic-MainStreetServer
host_name SV01
parents GW01_gateway
parents routerGW02
alias server1.example.org
}

define host {
use generic-MainStreetServer
parents GW01_gateway
host_name server-2
alias server02.example.org
description backup server for DNS in Main Street DC
parents routerGW02
}

define host {
use generic-MainStreetServer
host_name Server03
parents GW01_gateway
alias S03.example.org
description server for IPP in Main Street DC
parents router-GW02
}

define hostgroup (
hostgroup_name mainst_servers
members SV01,server-2,Server03
alias Main St. Servers
}

define service { service_description service for name resolution
		 use MainStreet-templsvc
		 service_name DNS_Mainst-resolvSV01
		 host_name SV01
		 check_command check_DNSr_MainStreet }

define service {	service_name resolver_server2_DNS_Mainst
		 	host_name server-2
	  		use MainStreet-templsvc
			service_description service for name resolution
			check_command check_DNSr_MainStreet}

define service {
host_name server-SV03
name service-CUPS-MainStreet
service_description service for printing
check_command check_IPP_MainStreet
use service-MainStreet
}

The above is a typical messy configuration file which has several weaknesses.

The first and major issue is that the list of hosts running a service is explicit in the service object definition. It means that:

The alternative is not to list the services to be checked on a host in the host object definition because that cannot be done, it is even better: to define a hostgroup object to match per each service, and to specify its name in the service object definition with hostgroup_name.

The second and also fairly significant problem is similar that in the hostgroup object definition there is an explicit list of the hosts that belong to the group, with much the same issues. The alternative is to specify an skeleton hostgroup definition and adds its name to to the hostgroups list in the host object definitions for members of the hostgroup.

Another significant issue that there are per-host service object definitions, which does not reflect that the same service may be running on many hosts.

The alternative is to create service definitions that are host-independent, and rely on parameterization to adapt each service check to the host it runs on, as far as possible.

Another significant issue is that the names and aliases of objects do not conform to a systematic naming convention, which means that:

The alternative is systematic naming conventions for objects, for example that all host object names for server should begin with server- so the wildcard server-* indicates them.

Finally, the order, indentation and layout of the object definitions is unsystematic too, and that makes them much harder to compare and maintain.

A different configuration would look like:

define hostgroup {
  hostgroup_name	datacenter-MainStreet
  alias			datacenter "MainStreet" hosts
}

define hostgroup {
  hostgroup_name	network-MainStreet-A
  alias			network "MainStreet" backbone A hosts
}
define hostgroup {
  hostgroup_name	network-MainStreet-B
  alias			network "MainStreet" backbone B hosts
}

define hostgroup {
  hostgroup_name	subnet-192-168-020_24
  alias			network subnet 192.168.20/24 hosts
}
define hostgroup {
  hostgroup_name	subnet-192-168-133_24
  alias			network subnet 192.168.133/24 hosts
}

define hostgroup {
  hostgroup_name	service-DNS
  alias			service DNS hosts
}
define hostgroup {
  hostgroup_name	service-DHCP
  alias			service DHCP hosts
}
define hostgroup {
  hostgroup_name	service-IPP
  alias			service IPP hosts
}

define service {
  use			service-generic-IP
  hostgroups		servers-DHCP
  service_name		service-DHCP
  service_description	serve network configurations

  check_command		check_dhcp!-u -s $HOSTNAME$
}
define service {
  use			service-generic-IP
  hostgroups		servers-DNSr
  service_name		service-DNSr
  service_description	serve DNS resolution

  check_command		check_dns!-H icinga.org -s $HOSTNAME$
}
define service {
  use			service-generic-IP
  hostgroups		servers-IPP
  service_name		service-IPP
  service_description	serve printer spooling

  check_command		check_ipp
}

define host {
  use			server-generic
  host_name		server-generic-MainStreet-AB
  description		server in Main Street DC on both backbones

  hostgroups		+ datacenter-MainStreet
  hostgroups		+ network-MainStreet-A,network-MainStreet-B
  hostgroups		+ subnet-192-168-020_24,subnet-192-168-133_24

  parents		gateway-01
  parents		gateway-02

  register		0
}

define host {
  use			server-generic-MainStreet-AB
  host_name		server-01
  alias			server1.example.org
  description		server for DHCP DNS in Main Street DC

  hostgroups		+ service-DNS,service-IPP
}

define host {
  use			server-generic-MainStreet-AB
  host_name		server-02
  alias			server02.example.org
  description		server for DNS in Main Street DC

  hostgroups		+ service-DNS
}

define host {
  use			server-generic-MainStreet-AB
  host_name		server-03
  alias			S03.example.org
  description		server for IPP in Main Street DC

  hostgroups		+ service-IPP
}

Please note all the details in the above. One that I did not announce is to define various hostgroups not associated with services, as normal, usually for physical location, power circuit, cooling circuit, physical network connection, subnets, type of host (router, switch, server, desktop, ...), organization. The main purpose is to spot at a glance whether lack of availability of a service or server is shared with others in the same hostgroup, for example because of a failed power circuit. Such hostgroups may be associated with services if a check is possible, for example for power availability.

I typically write distinct configuration files per collection of hosts, where each collection shares hostgroups, some service or host templates, some services, some routers, some switches, some servers, some desktops, usually in that order.

A well written, legible Nagios or Icinga configuration with the style above is both fairly easy to maintain and both very good documentation of an installation, and because of that also makes finding extent and causes of episodes of unavailability much easier.

130329 Fri: Effective and real user ids for files

Over many years while chatting with various people about the failings of various aspects of the UNIX design decisions, there are some that were failings not so much of the architecture but of how it was implemented, or incompletely implemented.

The biggest probably is the lack of distinction between effective and real ownership of files.

While UNIX processes have both an effective and a real owner, UNIX files only have one owner, which is a combination of the two.

This for example makes it very awkward to transfer ownership of a file to from one owner to another. Obviously the target user obviously cannot be allowed to take ownership on their initiative.

But less obviously the source user cannot be allowed to give ownership of a file to a target user, because it would allow them to retain all the advantages of file ownership, without the disadvantages (such as space accounting). Because a file can be linked into a directory only accessible by the source user, and then set with wide other permissions.

The obvious solution is to ensure that effective and real ownership be separate, with effective ownership being about access permissions, and real ownership about accounting aspects, with the rule that the real owner can change the effective owner and the permissions, and the effective owner can access the file and change the real owner to themselves.

In this way first the creator of a file can change the permission to none and the effective owner to the target user to initiate the transfer of ownership, and the target user can then accept the transfer by changing the real owner to themselves and then the permissions to acces the file themselves. If the target user does not accept the transfer the source user can always change the effective ownership of the file to themselves.

There is a question whether there should be two sets of permissions for effective and real owners, but I think that only one is necessary and desirable.

If it were just for transfer of ownership the above would be an important but non-critical internal limitation of the UNIX architecture, but the its importance is far greater than that, because it makes the set-uid feature of UNIX far more useful and easier to use.

Set-uid appears in two manifestations, one related to executable files, the other related to the ability of processes with an effective user of root to switch effective ownership to another user and back.

The essential aspect of set-uid for executable files is that it allows for programmatic transfer of data across different users: the target user sets up a set-uid executable file, which when invoked by a process to the source user will change the effective ownership of that process to the target user. Thus transfer of data can be done as:

  1. The source user writes and executes a program that opens the files containing the relevant data. Since the effective owner of the process running that executable is the source user it has permission to open files accessible to the source owner.
  2. The target user writes and makes available a program that is set-uid, and the source user's process execs it. it thus acquires the ability to access files accessible to the target user, but only under the control of code written by the target user, thus in a way that is safe for that target user.

The above classic scenario is typical, for example where the target user is a system dæmon or other subsystem, and the source user any other user.

It has several drawbacks, the most important being it depends utterly on writing very carefully the logic of the source and target programs, as it relies on keeping exactly the right file descriptors open in the same process when switching from the source user's executable to the target user's executable.

Having distinct effective and real owners for a file instead allows for a much simpler way to transfer data: put the data in a suitable file, and transfer the ownership of the file.

Consider for example a print spooler running as user lpd and the request by user example to print a file: the lpr command could simply, running set-uid with effective user lpd and real user example, change the effective ownership of the file to lpd (if licit), and then move or copy it to the print spool directory, with the added benefit that all the files to be printed would still have as real owner the source user, and thus be accounted to them.

But there is a much bigger scope for simplification and greater security in something related to set-uid, which relates to the ability of processes with effective owner root to switch temporarily or definitively effective ownership to another user, which because of various subtleties has involved complicated rules and security risks even worse than the executable-file related notion of set-uid.

The root system-call version of set-uid is typically used to do some operation between a source user and a target user mediated by a process owned by root, one that therefore needs to access resources first as the source user and then as the target user.

This could be solved by having a set-uid executable for every possible target user, but this is rather impractical in most cases; rather than doing so it is deemed more practical to have a single set-uid executable to root for getting data from the source user, and let processes with a real owner of root owned processes switch effective owner to any target user, with the following sequence of operations for a transfer from source user to target user:

  1. Source user forks process that opens one of its own files, execs an executable file set-uid to root.
  2. The process now runs a root trusted executable with effective owner root and real owner the source user, and changes the effective owner to the target user.
  3. Under the control a root trusted executable the process opens a file belonging to the target user and copies the data from the file descriptor opened in the first step to it, and then terminates.

This can happen with crossing 3 boundaries, and even more riskily it can happen not merely with an executable set-uid to root but with a dæmon started and running as root. For a practical example imagine the sending of messages from one user to another via a spooled mail system, where the message file is created by the source user, is held in a spool file, and then is appended to a mailbox file owned by the target user.

With the ability to distinguish between effective and real ownership of files there is no need to have root itself as an implicitly trusted intermediary, all that needs to happen is to have an explicitly trusted one as in:

Note that at no point there is any need for root as all-powerful intermediary, nor there is any need for the source or target users to trust the intermediary completely because the intermediary user is only effective thanks to the set-uid executables and the source and target files are the only ones belonging to the source and target users that are exposed to the intermediary by having their effective owner set to the intermediary user.

In the UNIX tradition there has been a very partial realization of the above logic, where the role of intermediary owning user was simulated by using an intermediary owning group instead. That is by using the owning group as a simulated effective user, thus resulting in the common practice of creating a unique group for every user with the same name as the user, for that purpose. But in the UNIX architecture owning groups do not have enough power and regardless using groups that way is a distortion of a different concept.

In the above I have not presented not all the details, as they should be quite natural; for example where I have used user that implies user or a process owned by the user, and owner might be not just the owning user but also the owning group (but I am skweptical this is needed or useful), and where I have written file I have also implied file or directory or other resource.

Also some of the transition and other rules can be tweaked one way or another (for example separate permission masks for real and effective owners), and there could be as mentioned different permissions for effective and real owner.

But the overall concept is to generalize the notions of effective and real owner for processes and of the set-uid mechanism to transition between them. In other words it is powerful idea to model any crossing of protection boundaries as two half steps, one initiated by the source protection domain, the second initiated by the target protection boundary, such that in the first permissions but not accountability are transferred by the source to the target, and in the second accountability is transferred by the target from the source, completing the crossing.

130328 Thu: 'collectd', web UIs and 'kcollectd'

While in general I like Ganglia and other performance monitoring systems, a strong alternative to its gmond agent, at least for UNIX-style systems, is collectd.

I was looking at the list of front-ends for it, to display graphically the information it collects, and it was depressing to see that most are web UIs and so many based on PHP which I dislike.

But fortunately there is a GUI program for that kcollectd which is quite good, compact and does not rely on running a web server, if only for local users, and a browser, and is very responsive graphically, and it is also well designed. This is a confirmation of my preference for traditional GUI interfaces instead of web UI ones. Also since there is little which is specific to collectd in kcollectd as it just displays RRD files, it can probably be used with most other performance monitoring systems that write to RRD files.

Anyhow if one wants to setup performance monitoring quickly on a single system locally, for example a desktop or a laptop, the combination of collectd and kcollectd is very convenient and informative and requires minimal work, in part because the default collectd configuration is useful.

130327 Wed: Book: "Monitoring with Ganglia"

At various times I have used the Ganglia monitoring system, most notably in relation to some WLCG cluster.

Ganglia is quite a good system and I haven't paid it much attention as for typical small to middling setups it sort of just works; its only regrettable feature is that it has a web UI and one written in PHP.

To have a nice printed reference I have bought Monitoring with Ganglia O'Reilly 2012-11 by Matt Massie, Bernard Li, Brad Nicholes, Vladimir Vuksan, Robert Alexander, Jeff Buchbinder, Frederiko Costa, Alex Dean, Dave Josephsen, Peter Phaal, Daniel Pocock.

This book documents the much more that can be done with Ganglia and rather comprehensively, comprehensively written, and even well written, and it is an excellent resource for understanding how it is designed and how ti can be configured and deployed.

The final section on deployment examples is particularly useful not just on how it can be deployed, but also on how best it can be deployed.

There some niggles, one of which is the book is really an anthology of distinct articles on various aspects of Ganglia, even if authorship of the articles sometimes overlaps. The overall plan of the anthology however is cohesive, so the articles flow into each other somewhat logically.

The main niggle however relates to what Ganglia is and how it is described, and this is reflected in the anthology structure of the book.

Initially Ganglia is described as a very scalable monitoring system, good for system populations of dozens of thousands to hundreds of thousands of systems, which is all agreeable, thanks to its highly parallel multicasting data collection default mode of operation.

The default mode of operation is run the gmond dæmon on each system which both collects data from that system and aggregates data from every other system in its cluster which is achieved by multicasting the measurements collected by each system to all other systems in the cluster, which are then condensed into an overall state by the gmetad dæmon and that is fetched and viewed by the PHP-based web UI.

Obvious this is note very scalable, while being very resilient, as each system has a full record of the measurements for every other system in the cluster, but this very replication and the traffic that realizes it makes for less scalability.

Scalability is achieved by two completely different means, as subsequent article illustrate very clearly:

Thus the main, and not very large, defect of the book is the confusion that is likely to happen to the reader when Ganglia as described as being scalable for some reasons at the beginnging, and then after a while those reasons are discounted and replaced by much better ones.

Overall it is a very good, useful, well written book with excellent setup advice.

130321 Thu: FLOSS UK Spring 2013 day two

My notes from the second day of presentations at the FLOSS UK Spring 2013 event:

PostgreSQL update:
  • Postgres has been 20 years in development, 1M lines codes, allegedly 10 times less bugs than MySQL, 10 times less than Oracle (per lines of code).
  • 400 patches per release, 8-10 FTEs over 50 contributors.
  • Debian reports 10k databases.
  • Heroku has 1.5m AWS servers with PostgreSQL.
  • Users report it has feature parity with MySQL and much better indexing and query planner.
  • 9.2 is better cloud database. Faster because it has better code, group commits are much faster on slow IOPS disks, reduced power consumption thanks to quick switch into sleep mode, without affecting performance. Also cascading replication reduces bandwidth to 1/2 or even 1/3. Also improved security.
  • 9.2 is also a better enterprise database, 3 times better performance for 32 way writes and 80 way reads.
  • 9.2 also has foreign tables which are views on a foreign database. This includes query optimization.
  • 9.2 also has somewhat better sysadm including better backup.
  • 9.2 is also a better DBMS for developers. For example better in-database procedures. Range types. Better SQL analysis and optimization, including spatial index access. Location queries take 1ms versus 10-15ms previously.
  • 9.2 also has index-only scans, single rows.
  • 9.2 complete table replacement as atomic operations.
  • Postgres also supports NoSQL modes of operation, like location, like key-value store, and is faster at that than MongoDB and full text indexing.
  • 9.3 has event triggers, b etter group commit, config dirs, 64b objects, in-database dæmons, COPY FREEZE.
  • Also 9.3 faster failover, lock reductions, materialized views, better JSON, writable foreign tables.
  • 9.3 maybe transparent reindexing, checkdums, type mapping using TRANSFORMs, parallel dumping.
  • In the future: distributed multimaster. 20 nodes already tested. Implements logical replication, which is not statement replication, it is mostly transaction log replication, 4-8 times faster than trigger-based statement replication (like Slony or Londiste). With normal locks slow, so it uses optimistic locking. Should be signed off June and production level testing by the end of the year.
  • Bi-directional replication will include filtering. Not yet automatic sharding, but manual sharding already available. With global sequencing of transactions.
  • How does multimaster compare to Cassandra? Fairly similar.
Monitoring at a hosting company:
  • Initially something as simple as Pingdom.com
  • Then decided to do it better with Net-SNMP RRD, Ganglia.
  • New hire did a Nagios and Munin config, manually. Eventually Cacti, which makes it much easier to keep things updated.
  • We ended up monitoring a lot of things that did not matter, and without a policy for alerts.
  • Munin forks a lot, still a 5 minute RRD quantum, and still monitoring pointless things like NTP clock skew. Plus crazy fsync storms every 5 minutes.
  • New hire horrified by our lack of automation, so introduced Puppet and Django. Used to take 2 days to setup a machine. Initially for low value customers, but it was better than manual builds. 150 hosts, 3 people.
  • Big issue with shared machines, because lack of separation among Apache2 instances, and stupid practices like 0777 directories. Switched to dedicated machines.
  • Switched to Icinga to get JSON.
  • Also switched to zero tolerance of criticals.
  • Vital detail: acknowledging for a period of time.
  • Problem with Munin during performance spikes as it spikes forks, and only every 5 minutes. So switched to collectd which is much faster, much more frequent probes, and does a target rate of fsync. Amazing difference in quality of data collected with 10s interval.
  • At this point is is Icinga to be a fork bomb, also why collect data for both Cacti and Icinga. So used a queue, etc.
  • Completely ripped out because too many problems signaled by a 10s interval. 5m was good to ignore small transients.
  • Soft issues: people burning out because of too many critical, and so defined policy which was less onerous than expected.
  • Currently using Graphite, feeding to RRD via collectd and statd.
  • Summary: monitor what matters, no notification without fixing it, discuss actual requirements, share monitoring with app developers.
  • Ganglia almost as good as collectd, both enormously faster than Cacti and Munin.
LDAP based office suite:
  • Components: EL (CEntOS), OpenLDAP, Alfresco, MySQL, Zarafa SugarCRM, Sendmail.
  • Setup: everything starts with OpenLDAP, top admin, and group for users.
  • Given that LDAP is used by Zarafa, SugarCRM, etc. and can be edited with phpLDAPAdmin, each needs larger memory setup than default.
  • Instead of Sendmail you can use Postfix in which case Zarafa integration is more straightforward.
  • The setup uses LDAP for Sendmail too.
  • As a backend MySQL is recommended because Zarafa and SugarCRM can only use that, so might as well use it for everything.
  • Zarafa by default uses MySQL for user register, so it has to be reconfigured to use LDAP. A Procmail recipe has to be used to deliver email to Zarafa.
  • Alfresco needs Tomcat and a number of additional modules, plus a number of different setup steps. It is vital to explicitly set logging, as the default is not suitable.
  • LDAP user entries need some extra lines for Zarafa and Samba.
  • Alternative pay-for Outlook connector. Sure it is expensive, but it is less expensive than MS-Exchange, and it is part of a FLOSS suite.
  • What about restricting login to SugarCRM to a specific LDAP group. It can do it, but is limited to one group.
  • Scalability: depends on how big email is. One case is 600 users with small mailboxes, and a single server, another is 200 users over 4 servers because they have huge mailboxes.
  • Kerberos for authentication: yes, I would use that.
OpenNebula:
  • NETWAYS.DE is a company that does open source management.
  • Organized Open Source Data Center and Puppet Camp conferences.
  • Evaluating for flexibility and datacenter virtualization. AWS too primitive, vCloud very virtualized but not very flexible, Eucalyptus a bit flexible a bit virtualized but amazing restrictions.
  • OpenNebula is an European project. Production ready.
  • It does monitor, schedule of VMs, manages images from which to create VMs. handles storage. Users, network configuration.
  • It also supports EC2 tools, and a command line interface, and an API based on XML-RPC and OCA.
  • State and accounts in some SQLite or MySQL database. Monitoring both internal and using Ganglia, or Icinga.
  • Resources can be setup as multiple levels of groups.
  • Really important for libvirt to be clustered as OpenNebula will restart everything if it cannot get VM status via it, even if the VMs are fine.
  • In older versions LDAP password are in cleartext.
  • Really nice CLI operation.
  • Demo of migrating a live service machine. Another demo of creating 6 VMs from a GRML liveCD.
  • The web UI is called Sunstone, and new version 4.0 has the same functionality but rather nicer. Even support for the Ceph file-system.
  • OpenNebula is a complete solution rather than a framework like OpenStack.
  • OpenNebula can be stopped and started at any time. When it restarts it just uses libvirt to poll hosts and figure out what the current state is.
  • OpenNebula allows configuration of MAC and IP addresses, but by default autoconfigures them.
Open source DNS servers:
  • We are not discussing djbdns as it is not maintained. Don't use it.
  • /etc/hosts is the first, then dnsmasq that serves /etc/hosts, and caching resolver.
  • MaraDNS and Deadwood, returns both A and PTR RRs, good for small situations, also for MS-Windows.
  • unbound is really nice, estensible with Python, DNSSEC well supported, very fast. Interesting mode dnssec-trigger for workstations.
  • NSD version 3, authoritative only, zone files are compiled, DNSSEC (cannot sign), very fast, root server. NSD version 4 imminent, slightly faster, dynamic zone adding and deleting. It has also RRL. Supports incremental zone transfers.
  • BIND version 9 commonest complaint is configuration is a bit annoying. It implements everything, including a lot of stuff that you don't want. DDNS, DBMS backed zones, RAM hungry, long restart times for 100k zones. Supports views, of questionable validity.
  • BIND version 10, completely different. REST API, DHCP, C++ or Python 3, but very annoying configuration. It does not recursive service for now.
  • Knot was created by the .cz registry. Authoritative only, supports incremental zone transfers. Does RRL, not yet but soon dynamic zone adding.
  • Yadifa is authoritative server, done by .eu people, to provide diversity.
  • PowerDNS lots of backends, very good DNSSEC, database replication via DBMS synchronization. Version 3 is particularly nice, please upgrade from 2.9.
  • PowerDNS Recursor recursive with local zones, /etc/hosts, scriptable zones.
  • Have a look at dnssexy.net.
  • What about IPAM and zone file editing? Well, you can use DDNS and DNSSEC to get much the same, and avoid zone reloads, and editing zones directly in databases. For example with a patch PowerDNS can do both DDNS and DNSSEC and that's very sweet.
  • Careful with qmail because it does any queries they often return DNSSEC and then if they are large enough qmail crashes.
Storage caching:
  • Why your SAN is slow: centralized disk farm, good for cost, bad for performance.
  • Good for backups, data in one place, but lots of competing workloads result in thrashing IO.
  • Make it faster with more hardware: more spindles, more RAM.
  • Make it faster with software: copy on write, quality of service. ZFS for example does not reuse space until disk is full, and performs very badly if free space is small or fragmented.
  • Hard disk: good at linear access, bad at tandom access, fragile. RAM fast for random, very expensive, very high power and heat. Flash in between, firmware is buggy though, writing costs quite a bit of power. Direct PCIe attached flash much faster, especially latency and IOPS. Rather expensive.
  • One good compromise is block level caching of SAN onto local flash drives.
  • Reads via cache are nearly as fast as uncached reads, writes can be write-round, write-through or write-back.
  • Flashcache is a Facebook kernel module, uses DM. Error resistant, retries. Unfortunately it allows you to screw things up because it does write back.
  • bcache other major tool, better performance as it is designed specifically for flash. Regrettably it requires an intrusive kernel patch, and to reformat the partition for caching.
  • ZFS is well known and it has flash specific modes of operation, L2ARC and slog. In both modes caches are not persistent across reboots, but in between cache devices can be added and removed.
  • There are also closed source option.
130320 Tue: FLOSS UK Spring 2013 day one

My notes from the first day of presentations at the FLOSS UK Spring 2013 event which began with a commemoration of pod (Chris Cooper):

Analyzing logs of SSH machines:
  • Two SSH hosts.
  • Really important to have a central log server.
  • One could buy a commercial log analyzer, we have built our own.
  • Login attempts classified as real/system/error/malicious.
  • Most login attempts are to root.
  • Advice: no remote logins, only allow personal accounts.
  • Top 10 non personal account lots of attempts.
  • 3 accounts compromised in 2-3 years, mostly from already compromised machines.
Loneliness of the long distance sysadmin:
  • Problem domain is IC layout repositories.
  • Lots of file transfers to other sites, as the asset library needs replicating.
  • Typical builds are 200GB and 1M inodes.
  • Audience: it is an asset library problem, all files are read-only and the big deal is fast loading.
  • There is a case for non-distributed setup, but hiring policies are an issue, as single site policy restricts range of possible candidates.
  • An alternative is centralized processing, with distributed GUI, but latency is a big issue.
Monitoring from :Bytemark:
  • Using Nagios at first.
  • Distinct alerting system Mauve. Important that it is fine grained and alerts most be acknowledgeable.
  • Acknowledge by SMS, email or Jabber.
  • Custodian raises or clear issues.
  • Does not do dependencies.
  • Main advantage scalability.
  • Internal scripts for resource monitoring.
Ansible overview:
  • Once upon a time shell scripts and SSH loops.
  • Puppet and Chef are both nice.
  • Ansible makes configuration management easy and production in a few minutes.
  • Very few dependencies. In practice only Python.
  • Inventory defines groups, hosts, and attributes attached to hosts, and groups can overlap and be nested, and the whole can be generated by a program.
  • Language to select sets of hosts.
  • Ansible sets a number of "facts" from the target host and local stuff. Ideally gathered by ohai or facter.
  • Playbooks are in YAML, collections of actions performed by modules.
  • Variables can come from inventory, plays, files, lookups.
  • Delegation means that to configure some aspect of host A do something on host B.
  • Ansible is push based, and can be become pull, but it always pulls the whole repositorty.
  • Question: pull mode means many systems will be down and yet will run. Point out that it is trivial to trigger push from a client.
  • Ansible can use fireball mode which uses 0MQ instead of SSH.
  • Ansible is a library too, so it is scriptable.
  • Extensible in any language, as long as already installed on target node.
Icinga:
  • Icinga is a fork of Nagios, because of IP issues and slow development.
  • Now database backed, but that is optional, unless you want new reporting.
  • New Core just a lot of fixes and new interesting tweaks.
  • Classic interface also makes life easier.
  • New UI has PHP, JavaScript, database backed.
  • Better PostgreSQL support. JasperReports based.
  • New Core was written with 0MQ, but it did not work out.
  • Icinga 2 redeveloped the core, for scalability, but only for very large setups. Made of components.
  • livestatus nice GUI.
  • Distributed case all have core, with different components at each installation; auth via certificates.
  • Icinga2 configuration has a better syntax, but old configuration works.
OpenNMS:
  • As the name says, OpenNMS has lots of network monitoring, but really it is about monitoring any infrastructure.
  • Lots of switch/router and OS types natively supported.
  • What you want: list systems, availability, ... reporting.
  • Both company and foundation behind it, and 100% GPLv3 code.
  • Consistent object model for huge scale.
  • In 2009 copyright licensing was cleaned up.
  • Not based on any other apps, but on nice libraries.
  • Architecture dictated by huge scalability, other requirements sacrificed to that.
  • Huge scalability means hundreds; scalability is requirement that comes from business growth.
  • List: discovery or provisioning. Checks: polling. What happened: notifications. Reporting: ....
  • Model: services, interfaces, nodes.
  • Discovery: usually via ping but others available. Once discovered it is provisioned.
  • Explicit provisioning also allowed, including imports and exports from other lists of entities.
  • Simple. Intermediate, extensive checks like Nagios webinject.
  • Events happen, they can be deduplicated to alarms with a count.
  • Notifications via a lot of channels. Bad idea to fork, because OpenNMS base process very good. Very configurable.
  • Also performance data collection adn graphing.
  • Various data protocols: SNMP most recommented. WMI is terrible, and XMP sounds a good successor to SNMP.
  • Flexible performance data based event specifications.
  • Net-SNMP good, customize using extend or pass. Releases less than 5.5 have various annoying bugs.
  • Reporting: nice graphs.
Lightning talk: user education.
  • Users with manually updated security software.
  • Reminders resulted in 25% of users actually updating the software.
  • Pestering users raised that to 35%.
  • Extreme pestering about a very big issues resulted in 45% updating.
  • Automatic updates resulted in near 100% working.
Lightning talk: Logstash and Kibana.
  • Using logstash to collect logs.
  • Split up logs.
  • Filter them.
  • Store them into pair database.
  • Kibana can be used to plot the summaries.
  • Interesting multifaceted summaries in Kibana.
  • Kibana can also do realtime incremental logging.
Lightning talk: pictorial programming language.
  • Written in Javascript, only for Chrome for now.
Lightning talk: CivicCRM.
  • CivicCRM package to maintain membership lists in nonprofit societies.
  • Roster of members.
  • Can create groups.
  • Automatic membership via web and HTTP.
  • Fundraising.
  • Event management too.
  • Automatic groups.
  • Mailing lists.
  • Can map into Drupal.
  • Memberships of 2,000 to 3,000 and related mailings succeed.
Lightning talk: :Bytemark Symbiosis.
  • Wasting time is very boring.
  • It is a package to manage simple sites.
  • Mostly Wordpress blogs.
Lightining talk: ultimate Debian database.
  • Tracking entities like bugs, packages, members.
  • The aim is to put all Debian related data records in one database.
  • A repository and not a system-of-record, purely for MIS purposes.
  • Same for local sysadm.
  • For Debian is PostgreSQL.
130319 Tue: FLOSS UK Spring 2013 tutorials

The FLOSS UK Spring 2013 event was preceded by a day of tutorials.

My notes from the Ansible tutorial:

My notes from the Juju tutorial:

130228 Thu: Using 'strace' or 'inotify' to find dependencies.

Quite amused today to find that some of the build systesm mentioned in an interesting survey and notably tup (1, 2) take an interesting approach to figuring out the dependencies between parts of the configuration to be built: instead of using a static tool like makedepend which examines the sources of the components to build, they trace dynamically which files are accessed the first time a full build is done, using something equivalent to strace -e open (using the system call ptrace) or ltrace -e open (using an LD_PRELOADed wrapper).

Which is something that I use by hand to reverse engineer software package builds and behaviour.

This is interesting for me in several ways, but mostly it is an admission of failure: the failure to design languages and programming frameworks that treat dependencies as explicit and important. The failure of the approach to design purposefully in other words, and the necessity to support programming styles based on slapping stuff together until it sorts of works. Which brings to mind web based applications :-).

130224 Sun: Enterprise flash SSD group test and maximum latencies

The performance of flash memory (EEPROM) based storage devices is rather anisotropic, and while the typical performance is quite good, in some applications the corner cases matter.

In a recent group test of some enterprise flash SSD the testers commendably paid attention to maximum latencies, which matter because erasing flash memory can take time, and anyhow most flash SSD need to do background reorganisation of the storage layout.

As the graphs make clear there is a huge gap between the average write latency of around 1 millisecond and the worst case latency of between 15 and 27 milliseconds.

It would have interesting to look at the variability of interleaving reads and writes, but there is another interesting test related to write rate (rather than latency) variability, where the result is pretty good as the tester is impressed:

Zeroing in to the one-second averages, as we did with Intel's SSD DC S3700, the P400m performs admirably (although it cannot beat the consistency we saw from the Intel drive). The SSD DC S3700 gave us 90% of its one-second averages within 99% of the overall average. In contrast, only 65% of the P400m's one-second averages fall within 99% of the overall average.

Micron's P400m does much better if you compare the individual data points to the product's specification instead of overall average. In fact, 99.8% of all one-second averages are higher than this drive's write specification.

A few months ago, these results would have been phenomenal. The problem is that Intel's 200 GB contender also achieves its results at a higher throughput.

The group test has also the usual interesting graphs that show how transfer size matters to average transfer rates.

130209 Sat: KDE Akonadi, Nepomuk, Strigi, and overambitious tracking

I have decided to have a look at the KDE SC 4 information management components, and I have written a brief summary as a new section in my KDE notes.

Having activated Akonadi and Nepomuk I was astonished to see that Nepomuk was scanning a large NFS filetree that I had mounted to set within it a full coverage of inotify watchpoints, when I had clearly indicated in the relevant settings that I wanted it to index only my home directory.

So I asked on the relevant IRC channel, and I was told that by default Nepomuk sets inotify watchpoints on every directory in the system because it has to monitor every directory in case a file for which it keeps tags or ratings is moved to it, so it can update its location in the database, as documented in two KDE issue entries (1, 2).

Because Nepomuk will build a database of file names only for specifically indicated directories, but will keep tags or ratings for any file anywhere, because even if it restricted setting tags or ratings only to files in the indicated directories, such files could be moved or linked to any other directory.

That to me seems quite an extreme view, because Nepomuk could set inotify watchpoints on the indicated directories only, and on the disappearance of a file from them could just remove from its database the associated tags or ratings.

Alternatively it could store tags or ratings in extended attributes which most UNIX/Linux filesystems support, and even some non-UNIX/Linux ones do. But I suspect that since KDE is supposed to be cross-platform, and indeed it is, the Nepomuk implementors preferred a method that relies on an external database to implement what are in effect custom extended attributes.

But the idea that Nepomuk is designed to keep track of associated external attributes for files anywhere on a system (with the possible exclusion of removable filetrees) is extreme, as some systems can have hundreds of thousands or dozen of millions of directories, each of which would require an inotify watchpoint.

130208 Fri: Impressions of the Dell U2412M monitor

Since I needed an extra monitor I have been rather uncertain whether to get a 27in LCD or a 24in LCD.

The point for a 27in LCD was to get one with a larger pixel size display like 2560×1440, with a 16:9 aspect ratio, such as a Digimate DGM IPS-2701WPH or a Dell U2713HM.

The point for a 24in LCD was to get a less expensive 16:10 aspect ratio display in a 1920x1200 pixel size. Eventually I decided to get one of the latter in part because I don't like 16:9 aspect ratios, and the 1200 pixel vertical size is (just) sufficient for me. But in part because 27in and larger display are a bit too large for my home computer desk, and a bit too heavy for my my monitor arm.

But mostly because I mostly use the monitor connected to a laptop and most laptops don't support the 2560×1440 pixel size, certainly not over VGA, and likely not over HDMI either.

I would have gone for another Philips but they no longer offer a monitor 24in 1920×1200 model with a wide viewing angle, so I bought a Dell U2412M which at around £220 inclusive of tax seems pretty good to me, the good things being:

The less good things are very few:

Compared to my other 24in 1920×1200 IPS display, the Philips 240PW9 it is broadly equivalent:

130129 Tue: USB3 and Thurderbolt transfer rates can be pretty good

In the usually excellent review and techology site X bit laboratories there is a recent review with intelligent, useful speed tests of an external disk with both USB3 and Thunderbolt interfaces.

Because the default rotating disk drive has a maximum transfer rate of around 115MB/s the reviewers replaced it with a flash SSDs with both SATA2 and SATA3 interfaces capable of up to 400-500MB/s and amazingly both USB3 and Thurderbolt support a few hundred MB/s transfer rates with top read rates of 300-370MB/s and top write rates of 240-270MB/s.

This is quite remarkable and not far from the transfer rates possible with eSATA which remains my favourite high speed interface as it is very likely to be less buggy and it is cheap. The same article makes clear that USB2 is still pretty slow with a top read and write speeds of 35MB/s and 28MBs/, but that is usually adequate for most peripherals, except for storage and video ones.

But USB3 is certainly a good alternative, and increasingly popular, and lower cost than Thunderbolt. It is still USB, and therefore likely to be much buggier especially in corner cases than eSATA, but it will probably be useful to have.

130119 Sat: The XNS internet framework compared to TCP/IP

In various discussions about networking I mentioned the XNS internet framework, which was one of the first large scale internetworking standards and was highly popular for decades as it was adopted by several laerge vendors (Novell, Banyan, ...).

It was in many ways better designed than the ARPANET/NCP framework, and as a result inspired many aspects of the TCP/IP network framework that succeeded the ARPANET one, and arguably it was in some ways better than TCP/IP.

The most interesting aspect of XNS is the addressing structure, where XNS addresses are 80 bit long, where the bottom 48 bits uniquely identify each host, and the top 32 bits are a routing hint that identifies the network to which the host is connected.

Such an addressing structure is interesting for several reasons, the principal of which is that the availability of 32 bit long network prefixes and 48 bit long host addresses would have most probably avoided running out of addresses as in the current IPv4 situation.

Because 232 networks is a pretty large number, especially as each network can have an essentially arbitrary number of hosts.

Note that the number of hosts in a network is limited only by the size of the routing tables used in that network, as individual host routes can always be used, and thus a network could have several thousand hosts.

The other interesting aspect of the addressing structure is that the network number is not actually part of the address of a host, it is just a routing hint, and that therefore 48 bit host addresses must be globally unique. Indeed in XNS the prescription was that they should be Ethernet addresses, as there is already a mechanism to ensure their global unicity.

Having host addresses be identically the same as Ethernet addresses has an interesting advantage that it avoids the need for neighbor discovery protocols like ARP at least on Ethernet, simplifying protocol implementation.

This is possible because appositely XNS requires that all Ethernet interfaces a host has have the Ethernet address, that is the XNS host address, a somewhat interestin situation.

One other interesting aspect of XNS was that it was based on 576 byte packets, even if the typical underlying medium was Ethernet and supported up to 1500 byte frames; the small maixmum guaranteed packet size also meant that path MTU discovery was not necessary.

Apart from the addressing structure one of the more interesting differences between XNS and TCP/IP is the rather different starting points:

While in some ways I think that the TCP/IP way is more universal, the XNS architecture matches better how networking has actually been used for decades, as mesh networks are exceptionally rare nowadays, and the Internet is an internet of LANs.

130117 Thu: Comparing the resilience of RAID6 and RAID 10

Suring a mailing list discussion about 4 drive RAID sets someone argued that raid6 can lose any random 2 drives, while raid10 can't which seems superficially a good argument.

It is instead a poor argument based on geometry, one of those I call a syntactic argument, where the pragmatics of the situation are not taken into account.

The major point in a comparison of RAID6 with RADI10 is that they have very different performance and resilience envelopes which cannot be compared in a simplistic way.

As to speed (a component of performance) a RAID10 set intact or with 1 drive missing usually is much better than a an equivalent 4 drive RADI6 set intact or with 1 drive missing, and similarly during rebuld, because, for example:

Resilience is also not comparable, in geometric terms because a RAID10 can continue operating despite the loss of more than 2 members of the set as long as they are in different pairs.

Also, in purely geometric terms it is easy to have a RAID10 made of 3-way RAID1s, and that would also have the property of continuining to operate regardless of the loss of any 2 drives.

But it may be objected at this point that would cost a lot more than a RAID6, and then if this objections is relevant it must be because purely geometric considerations, or purely speed based ones. Then one must consider opportunity costs, incuding purchase cost and other environmental factors.

Purely geometric considerations as to resilience in particular are rather misleading, for example because they must be based on the incorrect assumptions that RAID set failure rates are uncorrelated, and independent of the RAID set size and environmental conditions.

But the probability of a further failure in a RAID set depends on its size and that probability is correlated and depends on common modes of failure such as all drives being similar and in a similar environment, which are the worse during all three modes of operation, intact, incomplete and rebuiding:

The extra stresses and especially during incomplete and rebuilding operation can significantly raise the probability of a further drives failure with respect to a similar RAID10 set, and these are inevitably due to the P and Q blocks correlating work across the whole stripe.

The single great advantage of a RAID6 set is that it performs much like a slightly smaller RAID0 set during pure read workloads, and that it offers some degree of redundancy, at the heavy cost of making writes expensive, and incomplete or rebuilding operation very expensive.

Therefore in the narrow number of cases and setups where that performance and resilience envelope fits the workload it is useful. But it is not comparable to RAID10, whether with 2 or 3-way mirroring, and the RAID10 is generally preferable.

130116 Wed: MTA checking of MAIL FROM addresses strangeness

I was asked for an opinion on a case of an internal leaf mail server having mail refused by an internal destination server and this is the relevant transcript:

$ telnet smtp.example.com 25
Trying 192.0.2.10...
Connected to smtp.example.com.
Escape character is '^]'.
220 mail.example.com ESMTP Postfix
HELO server1.example.net.com
250 mail.example.com
MAIL FROM: <test@server1.example.net.com>
250 Ok
RCPT TO: <a.n.user@example.com>
450 <test@server1.example.net.com>: Sender address rejected: Domain not found
quit
221 Bye
Connection closed by foreign host.

Here the obvious minor issue is that the domain suffix of the name of the sending internal node was mistyped as .example.net.com instead of the intended .net.example.com.

The slightly less obvious minor issue is that on server1 the MTA configuration defaults to append the full name of the node, instead of the name of the email domain example.com, which can be achieve in Postfix with append_at_myorigin or append_dot_mydomain.

But the really bad detail is that the destination MTA, for which example.com is local, checks the validity of the domain of the MAIL FROM: address at all, because the Postfix parameter reject_unknown_sender_domain is set, and that such a parameter defaults to enabled and exists at all, is rather improper for several reasons:

130114 Mon: DM/LVM2 usefulness and lack thereof

While discussing my impression that LVM2 is mostly useless except for snapshots (also 1, 2) a smart colleague pointed out that there it has one other useful feature, the ability to move a LV from one storage device to another transparently that is while it is being actively accessed.

That was an intereating point because the overall reasons why I don't value DM/LVM2 is that I prefer to manage storage always at the filetree level, that is by dealing with collections of directories and files, because that's what is portable and well understood and clear.

In general I find very little reason to have either more than one partition per storage device, or even worse to have a partition spanning multiple storage devices, unless some level of RAID is involved, and RAID usually has significant advantages over mere volume management.

However admittedly the ability to transparently a block device is sometimes interesting, just like the ability to snapshot a block device. These are I think two occasionally worthwhile reasons for virtualizing a block device via DM/LVM2.

However the ability to transparently move filetrees is also available from the underrated AFS which can be deployed in pretty large (800TB, 200M files) configurations with the ability to manage over a hundred thousand independent filetrees, and it is also part of the abilities of ZFS BTRFS as they in effect contain something like DM/LVM2, but in a slightly less futile design.

Other than AFS I personally think that moving the contents of block devices is best done with a bit of planning ahead using standardized partition sizes and by using RAID1 or something similar like DRBD because it is rather simpler and more robust.