VisTopK – Initial Setup November 6, 2006Posted by shahan in eclipse, eclipse plugin, GEF.
add a comment
Today I officially started coding for the project and was very productive! I needed to setup a set of inverted indexes based on a collection of files. To setup these indexes, I used Apache Lucene and was it ever easy. I had some difficulties initially as I was trying to use some contributed modules (the in-memory indexer to be exact, but I figured it might come in handy, at some point). I also incorrectly decided to use an embedded relational database to allow for a more “natural” way to access indexes. Based on the information found here, I decided to give HSQLDB a shot and it was extremely easy to setup and use, but instead, I removed the relational database, used Lucene’s built-in query engine, and accessed all the inverted indexes for the terms within the document collection.
Now it’s a matter of deciding to take TReX’s existing No-Random-Access threshold algorithm code or just roll my own. Reasons to stay away from the existing code are: tight integration with TReX’s data structures, lack of parameters for its use, and a bad code smell. If I roll my own, then it’s to decide whether I should integrate it into Lucene or keep it as a simple external algorithm engine. An ambitious Lucene contrib vision possibly? I’ve never contributed to open source but am truly inspired by the dedication required as described in Karl Fogel’s (FREE) ebook Producing Open Source Software. Starting from scratch will also allow for a nicer class hierarchy to take advantage of some interesting concepts mentioned in the IO-Top-k paper, concepts such as propabilistic inference or skew detection to terminate the algorithm even sooner.
Once that’s done I’ll have a prototype for XML document collection indexing and retrieval using a threshold algorithm for top-k query processing.
Some related tools that seem interesting are: Luke, a Lucene index-modification/viewing tool, very nice looking and feature filled. Lius, which I haven’t tried and seems to do the same thing as Luke.
TReX and XSummary November 5, 2006Posted by shahan in information retrieval, software development, XML.
add a comment
Currently, as part of my Research Assistanceship supervised by Dr. Consens, I have worked with several of the existing code bases. One is called TReX which was used in the Initiative for the Evaluation of XML Retrieval (INEX). INEX is a global effort consisting of over 50 groups who participate by working on a common XML document collection (Wikipedia this year, IEEE articles last year). The sharing of results promotes an open research environment and also helps direct future research initiatives. I am in the process of refactoring and separating the implementation of TReX into discrete and modular components. The refactoring process is challenging but very rewarding as it requires understanding not only how tightly integrated the data structures are with each other, but also what the code is actually doing and why. One tool I found very helpful in understanding the system as a whole is an Eclipse plug-in called Creole. It provides the ability to visualize the java packages, classes, method calls, and even view the source code all from within a common interface with boxes and arrows. The most useful feaure applied against TReX was the ability to view the building of the code through the CVS check-ins. Further, a cross-listed course (one which is both an undergraduate and graduate level) I took this summer of 06, Software Architecture and Design taught by Greg Wilson, was extremely useful as it taught how software patterns can help prevent the problems facing the code currently. Greg taught ways of thinking about the structure of software in order to allow for effective expansion. The paper Growing a Language by Guy L. Steele Jr. is an excellent read which describes the concept of growth, while not directed towards software design, the concepts and notions are equally applicable. It is also a very easy read which may at first glance seem very strange.
A second code base which I have worked with is called XSummary, which summarizes the structure of XML document collections. It was demonstrated at the Technology Showcase at IBM’s CASCON 2006 in mid-October. It is an Eclipse plug-in developed using the Zest visualization toolkit. It presented structural summaries of various XML document collection applicable to Wikipedia, BPEL, blog feeds in RSS and Atom. It depicts the parent-child relationships of XML structural elements and allows addition of detail through displaying of different summaries on a particlular element. XSummary is developed by Flavio Rizzolo. I also integrated a coverage and reachability model developed by Sadek Ali. Coverage and reachability provide a way of identifying which elements are considered important within the colleciton with the ability to specify a range, thus simplying the detail level by displaying only the elements within a slider-selected range.
VisTopK – Visualization of Top-K Information Retrieval Algorithms November 3, 2006Posted by shahan in eclipse, eclipse plugin, GEF.
add a comment
This project is for a course credit and is supervised by Dr. Consens. I will provide a visualization of Top-K results using static and dynamic views of Threshold Algorithm (TA) traces. This involves the use of inverted indexes from an XML document collection and will explore the effect of various factors such as compression, encoding, and retrieval/indexing algorithms. The deliverable will be a plug-in in Eclipse, due to its extensibility and ease of integration with various components. Current thoughts of the plugin are a main view containing a stock-chart and a property view. The visualization will be a front end to the XML format trace data collected during a separate TA run. Static and dynamic views will be provided for snapshots and temporal views. The visualization will be based on the depiction in:
IO-Top-k: Index-access Optimized Top-k Query Processing. Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, Gerhard Weikum.
An interesting paper which provides a good basis for TAs, the algorithms themselves are relatively easy to understand but requires acquiring the terminology of the context. The relation between, advantages, and differences of several different TA methods are discussed, some of which are No Random Access (NRA), Random Access (RA), and a combination of both.
Top-K Results and Threshold Algorithms November 3, 2006Posted by shahan in information retrieval.
add a comment
Top-K retrieval results are simply a method of requesting the top k elements from a collection, akin to saying the top 10 sellers on eBay. A Top-K algorithm is also known as a Threshold Algorithm (TA), one which terminates when a certain threshold is achieved. The threshold in IR is based on the inverted index maintained on a particular keyword (associating a relevance value with they keyword’s location, as mentioned in a previous blog entry). The simplest example: if looking for keyword k in an inverted index R containing n elements, where k < n, then by simply sorting R in descending order, we can retrieve the first k elements, resulting in top K. TAs are an extension to performing this task over several lists and provide various optimizations based on available features of the inverted indexes. One important feature is the ability to perform only sequential scans, random access, or a combination. An example of a situation where random access is not available is when basing the inverted index on a list that may change and is not directly accessible, parsing of Google search results for instance. More information on the features and optimization of TAs will be provided later.