Software and hardware annotations 2009 August

This document contains only my personal opinions and calls of judgement, and where any comment is made as to the quality of anybody's work, the comment is an opinion, in my judgement.

[file this blog page at: digg del.icio.us Technorati]

090828 Fri Some experiences with GNU/Linux search engines

Some years ago well before the surge in popularity of web search engines I used to be interested in textual databases, which is what web search engines use as backends. At that time these were largely based on the Z39.50 protocol for textual database queries, and I used FreeWAIS for indexing my e-mail and news archives. It used to be that only information retrieval people knew about WAIS and similar tools, and they fell a bt into obscurity, while various people started applying information retrieval to web searching, and I stopped using FreeWAIS, relying more on archiving by topic.

The popularity of web based index searching has made textual databases a current topic again, and this has resulted in several new WAIS-like projects, only done a bit amateurishly at times. Anyhow I wanted to start using one to index email and saved web pages, so I had a look, and looked at several and tried briefly some.

The ones I tried are:

Strigi

This is the KDE default indexer, and I tried version 0.5.11 under KDE 3.5, and it is based on Lucene. The good points are that it is fairly UNIXy, with decent command line support, and an interesting idea for query language. Well, it sort of worked, sometimes; but it looped all the time, and would get stuck, and the communication between the GUI client and the indexer and searcher was unreliable.

Google Desktop for Linux

Interesting package, a bit old, but seemed to mostly work. It seems to have comprehensive format support and searches semms to be quite good. But it has three big issues:

It is a closed sources application with virtually no documentation, and the GNU/Linux version seems to be low priority. It could easily become abandonware.
It is extremely un-UNIXy as it has no command line option; once launched all use and control is via a web interface,and all files are binaries with an unknown format.
In order to reduce system load while indexing, it indexes very slowly even when there is no load on the system.

and a showstopper: if the system crashes while it is indexing, something gets corrupted and the crawler subprocess dies repeatedly. one can recover by saving the repo directory and then recreating the containing desktop directory, but that is rather incovenient.

Beagle

I have only very lightly tested Beagle, and it seems fairly nice, with a rather negative point for me that it written for the Microsoft CLR. But it has a showstopper too: Beagle keeps a cache of indexed documents, and this cache contains a very large number of files under 2KiB in size. This makes things like backups very very slow.

recoll

This is a very nice package indeed. It is very UNIXy, with good command line support, sensible configuration files, and even extensive documentation.