VisTopK – Initial Setup November 6, 2006Posted by shahan in eclipse, eclipse plugin, GEF.
Today I officially started coding for the project and was very productive! I needed to setup a set of inverted indexes based on a collection of files. To setup these indexes, I used Apache Lucene and was it ever easy. I had some difficulties initially as I was trying to use some contributed modules (the in-memory indexer to be exact, but I figured it might come in handy, at some point). I also incorrectly decided to use an embedded relational database to allow for a more “natural” way to access indexes. Based on the information found here, I decided to give HSQLDB a shot and it was extremely easy to setup and use, but instead, I removed the relational database, used Lucene’s built-in query engine, and accessed all the inverted indexes for the terms within the document collection.
Now it’s a matter of deciding to take TReX’s existing No-Random-Access threshold algorithm code or just roll my own. Reasons to stay away from the existing code are: tight integration with TReX’s data structures, lack of parameters for its use, and a bad code smell. If I roll my own, then it’s to decide whether I should integrate it into Lucene or keep it as a simple external algorithm engine. An ambitious Lucene contrib vision possibly? I’ve never contributed to open source but am truly inspired by the dedication required as described in Karl Fogel’s (FREE) ebook Producing Open Source Software. Starting from scratch will also allow for a nicer class hierarchy to take advantage of some interesting concepts mentioned in the IO-Top-k paper, concepts such as propabilistic inference or skew detection to terminate the algorithm even sooner.
Once that’s done I’ll have a prototype for XML document collection indexing and retrieval using a threshold algorithm for top-k query processing.
Some related tools that seem interesting are: Luke, a Lucene index-modification/viewing tool, very nice looking and feature filled. Lius, which I haven’t tried and seems to do the same thing as Luke.