TReX and XSummary November 5, 2006Posted by shahan in information retrieval, software development, XML.
Currently, as part of my Research Assistanceship supervised by Dr. Consens, I have worked with several of the existing code bases. One is called TReX which was used in the Initiative for the Evaluation of XML Retrieval (INEX). INEX is a global effort consisting of over 50 groups who participate by working on a common XML document collection (Wikipedia this year, IEEE articles last year). The sharing of results promotes an open research environment and also helps direct future research initiatives. I am in the process of refactoring and separating the implementation of TReX into discrete and modular components. The refactoring process is challenging but very rewarding as it requires understanding not only how tightly integrated the data structures are with each other, but also what the code is actually doing and why. One tool I found very helpful in understanding the system as a whole is an Eclipse plug-in called Creole. It provides the ability to visualize the java packages, classes, method calls, and even view the source code all from within a common interface with boxes and arrows. The most useful feaure applied against TReX was the ability to view the building of the code through the CVS check-ins. Further, a cross-listed course (one which is both an undergraduate and graduate level) I took this summer of 06, Software Architecture and Design taught by Greg Wilson, was extremely useful as it taught how software patterns can help prevent the problems facing the code currently. Greg taught ways of thinking about the structure of software in order to allow for effective expansion. The paper Growing a Language by Guy L. Steele Jr. is an excellent read which describes the concept of growth, while not directed towards software design, the concepts and notions are equally applicable. It is also a very easy read which may at first glance seem very strange.
A second code base which I have worked with is called XSummary, which summarizes the structure of XML document collections. It was demonstrated at the Technology Showcase at IBM’s CASCON 2006 in mid-October. It is an Eclipse plug-in developed using the Zest visualization toolkit. It presented structural summaries of various XML document collection applicable to Wikipedia, BPEL, blog feeds in RSS and Atom. It depicts the parent-child relationships of XML structural elements and allows addition of detail through displaying of different summaries on a particlular element. XSummary is developed by Flavio Rizzolo. I also integrated a coverage and reachability model developed by Sadek Ali. Coverage and reachability provide a way of identifying which elements are considered important within the colleciton with the ability to specify a range, thus simplying the detail level by displaying only the elements within a slider-selected range.