Tags: cascon, conference, describex, ibm, vistopk
add a comment
I’m happy to say that two projects that I work on, DescribeX (a team effort with Sadek Ali and Flavio Rizzolo) and VisTop-k, both of which are supervised by Dr. Mariano Consens, will be demonstrated at IBM’s CASCON Technology Showcase on October 22 – 25, 2007. There were quite a few interesting projects last year and I’m looking forward to seeing what new ideas have arisen, especially since my Eclipse plugin skills have increased a tremendous amount. As a student I’m also looking forward to the food 😉
Published: Exploring PSI-MI XML Collections Using DescribeX October 2, 2007Posted by shahan in publication, software development, standards, visualization, XML.
Tags: publication, standards, XML
1 comment so far
My first official publication 🙂 Thanks to Reza for putting so much hard work into it as well as his patience for some of the DescribeX bug fixes. Many thanks also go to my professors Mariano and Thodoros who guide and encourage at every opportunity.
PSI-MI has been endorsed by the protein informatics community as a standard XML data exchange format for protein-protein interaction datasets. While many public databases support the standard, there is a degree of heterogeneity in the way the proposed XML schema is interpreted and instantiated by different data providers. Analysis of schema instantiation in large collections of XML data is a challenging task that is unsupported by existing tools. In this study we use DescribeX, a novel visualization technique of (semi-)structured XML formats, to quantitatively and qualitatively analyze PSI-MI XML collections at the instance level with the goal of gaining insights about schema usage and to study specific questions such as: adequacy of controlled vocabularies, detection of common instance patterns, and evolution of different data collections. Our analysis shows DescribeX enhances understanding the instance-level structure of PSI-MI data sources and is a useful tool for standards designers, software developers, and PSI-MI data providers.
Reza Samavi, Mariano Consens, Shahan Khatchadourian, Thodoros Topaloglou. Exploring PSI-MI XML Collections Using DescribeX. Journal of Integrative Bioinformatics, 4(3):70, 2007. Online Journal: link
Precis: Wiemin He, Leonidas Fegaras, and David Levine – “Indexing and Searching XML Documents based on Content and Structure Synopses” August 16, 2007Posted by shahan in precis.
add a comment
Wiemin He, Leonidas Fegaras, and David Levine
“Indexing and Searching XML Documents based on Content and Structure Synopses”
BNCOD 2007, Glasgow, July 2007
Information retrieval from XML data is usually performed by creating an inverted index for each text-containing element. If specifying path constraints is desired, the structure of the XML documents must be also be maintained. Allowing full-text searching of XML documents with the ability to return individual elements tends to generate very large indexes, which adversely affects space and time costs.
Various containment filtering methods are compared. These are applied against bit matrix data structures representing the different aspects involved in information retrieval from XML collections: XML document structures (Structural Summary), term location (Content Synopses), and path relations (Positional Filters).
An interesting use of two-dimensional Bloom filters are described in its application with XML document collections. Optimizations of their algorithms by using a hash-based, as well as, a novel two-phase containment filter are demonstrated.
The well-described experiments show that the novel combination of methods improves space and time constraints considerably. However, challenges such as the effect of multiple indexes due to hardware limitations are not mentioned. Additionally, a test against a second collection is mentioned but not ascribed, nor is the claim that BerkelyDB is a relational database correct. Overall, the paper is very difficult to understand due to the lack of clear graphics and the tendency to describe the more complex process prior to the basic reasons. Other minor annoyances involve the lack of an initial full document graph, overuse of the word “basically”, and changing the running query near the end of the paper.