Some years ago well before the surge in popularity
of web search engines I used to be interested in
textual databases, which is what web search engines
use as backends. At that time these were largely based
on the
Z39.50
protocol for textual database queries, and I used
FreeWAIS
for indexing my e-mail and news archives. It used to
be that only information retrieval people knew about
WAIS and similar tools, and they fell a bt into
obscurity, while various people started applying
information retrieval to web searching, and I stopped
using FreeWAIS, relying more on archiving by topic.
The popularity of web based index searching has made
textual databases a current topic again, and this has
resulted in several new WAIS-like projects, only done
a bit amateurishly at times. Anyhow I wanted to start
using one to index email and saved web pages, so I had
a look, and looked at several and tried briefly
some.
The ones I tried are:
- Strigi
- This is the
KDE
default indexer, and I tried version 0.5.11 under
KDE 3.5, and it is based on Lucene. The good points
are that it is fairly UNIXy, with decent command
line support, and an interesting idea for query
language. Well, it sort of worked, sometimes; but
it looped all the time, and would get stuck, and the
communication between the GUI client and the indexer
and searcher was unreliable.
- Google Desktop for Linux
- Interesting package, a bit old, but seemed to
mostly work. It seems to have comprehensive format
support and searches semms to be quite good. But it
has three big issues:
- It is a closed sources application with
virtually no documentation, and the GNU/Linux
version seems to be low priority. It could
easily become abandonware.
- It is extremely un-UNIXy as it has no command
line option; once launched all use and control
is via a web interface,and all files are
binaries with an unknown format.
- In order to reduce system load while indexing,
it indexes very slowly even when there is no
load on the system.
and a showstopper: if the system crashes while
it is indexing, something gets corrupted and the
crawler subprocess dies repeatedly. one can recover
by saving the repo
directory and then
recreating the containing desktop
directory, but that is rather incovenient.
- Beagle
- I have only very lightly tested Beagle, and it
seems fairly nice, with a rather negative point for
me that it written for the Microsoft CLR. But it has
a showstopper too: Beagle keeps a cache of indexed
documents, and this cache contains a very large
number of files under 2KiB in size. This makes
things like backups very very slow.
- recoll
- This is a very nice package indeed. It is very
UNIXy, with good command line support, sensible
configuration files, and even
extensive documentation.