ICDE 2008 DescribeX Demonstration April 10, 2008Posted by shahan in Uncategorized.
Tags: automaton, automaton intersection, Brics, describex, eclipse, GEF, refinement, structural summary, XML, Zest
add a comment
This post is an outline of the DescribeX and the demonstration at ICDE 2008. The 4-page demonstration submission will be available soon.
UPDATE: The submission is available online here.
DescribeX is a graphical Eclipse plugin for interacting with structural summaries of XML collections. It is developed in Java using GEF, Zest (now incorporated into GEF), Brics (a Java automaton library), and Apache Lucene (a Java information retrieval library). The structural summaries are defined using an axis path regular expression (AxPRE).
Several versions have been developed, each new version allowing a different type of summary as well as different interactions with the summary.
The oldest version, originally developed for Cascon 2006, created a P* summary (or F&B-Index) and thus created the structural summary as a tree. A tree graph layout algorithm from GEF was used. Only a P*C refinement was available using XPath expressions evaluated against all the files in the collection. The control panel for this version is on the bottom on the left.
The second version allowed the creation of an A(k)-index, allowing the user to specify the height in the path for which to consider when creating the summary partitions. This used Zest (now incorporated into GEF) for the layout algorithm due since a structural summary based on the A(k)-index can create a graph instead of a tree.
The third version implements the true AxPRE expressions, using the Brics automaton Java library for converting the regular expression to a NFA. A label summary was created of the collection and refinements were processed by intersecting the NFA of the regular expression with the automaton of the label summary. Zest was also used for the layout algorithm. The control panel for this version is in the middle on the left side.
The differences between the versions are in the extra features such as the additional filters such as coverage and highlighting elements from a keyword query.
The key points of the demonstration are that our tool allows a user to quickly and easily determine the paths that exist in the collection, determine the importance of summary nodes, as well as interact with the structural summary by performing refinements. An additional aspect is the ability to highlight the elements that contain the terms in keyword search, this is in relation to our participation in INEX.
The attached screenshot shows three graphs, the topmost and middle graphs are P* structural summaries (or F&B-indexes) of two protein-protein interaction (PPI) datasets conforming to the PSI-MI schema standard. These two graphs are based on the first version and shows the important nodes coloured green using a coverage value of 50%, i.e. showing the nodes that together contain 50% of the entire collection’s total number of extents. Other coverage measures are easily available (such as a random walk coverage) and easily implementable. The first (topmost) dataset, HPRD, is a single 60MB XML file while the second (middle) dataset, Intact, is a collection of 6 XML files totalling 20MB. It should be noted that these are only a small subset of the gigabyte size collections available. We can see that the structure of the larger HPRD collection has a smaller structure in use than the Intact collection.
I obtained some very good feedback after demonstrating DescribeX to several of the attendees. Some of the feedback included displaying cardinalities as well as displaying the information retrieval component while using summaries. It would have been nice to show how the scoring of a document would have been affected if some of the summary nodes were refined using an AxPRE to combine elements containing the search term. Next time I hope to allow the user to use the plugin to prod the product, “It’s like walking the high wire without a safety net” as Guy Lohman put it.
Future work involves preparing a downloadable plugin for interested users. As it stands, the three versions can be made available and can work alongside each other (and actually the third version requires the first version); however, the instructions for use have not been updated in a while (though the application is easy to use). There is also a lack of extensibility of the newer version since I would like to update the way in which the extension point for filters and coverage are implemented.
Published: Exploring PSI-MI XML Collections Using DescribeX October 2, 2007Posted by shahan in publication, software development, standards, visualization, XML.
Tags: publication, standards, XML
1 comment so far
My first official publication :) Thanks to Reza for putting so much hard work into it as well as his patience for some of the DescribeX bug fixes. Many thanks also go to my professors Mariano and Thodoros who guide and encourage at every opportunity.
PSI-MI has been endorsed by the protein informatics community as a standard XML data exchange format for protein-protein interaction datasets. While many public databases support the standard, there is a degree of heterogeneity in the way the proposed XML schema is interpreted and instantiated by different data providers. Analysis of schema instantiation in large collections of XML data is a challenging task that is unsupported by existing tools. In this study we use DescribeX, a novel visualization technique of (semi-)structured XML formats, to quantitatively and qualitatively analyze PSI-MI XML collections at the instance level with the goal of gaining insights about schema usage and to study specific questions such as: adequacy of controlled vocabularies, detection of common instance patterns, and evolution of different data collections. Our analysis shows DescribeX enhances understanding the instance-level structure of PSI-MI data sources and is a useful tool for standards designers, software developers, and PSI-MI data providers.
Reza Samavi, Mariano Consens, Shahan Khatchadourian, Thodoros Topaloglou. Exploring PSI-MI XML Collections Using DescribeX. Journal of Integrative Bioinformatics, 4(3):70, 2007. Online Journal: link