Precis: Wiemin He, Leonidas Fegaras, and David Levine – “Indexing and Searching XML Documents based on Content and Structure Synopses” August 16, 2007Posted by shahan in precis.
Wiemin He, Leonidas Fegaras, and David Levine
“Indexing and Searching XML Documents based on Content and Structure Synopses”
BNCOD 2007, Glasgow, July 2007
Information retrieval from XML data is usually performed by creating an inverted index for each text-containing element. If specifying path constraints is desired, the structure of the XML documents must be also be maintained. Allowing full-text searching of XML documents with the ability to return individual elements tends to generate very large indexes, which adversely affects space and time costs.
Various containment filtering methods are compared. These are applied against bit matrix data structures representing the different aspects involved in information retrieval from XML collections: XML document structures (Structural Summary), term location (Content Synopses), and path relations (Positional Filters).
An interesting use of two-dimensional Bloom filters are described in its application with XML document collections. Optimizations of their algorithms by using a hash-based, as well as, a novel two-phase containment filter are demonstrated.
The well-described experiments show that the novel combination of methods improves space and time constraints considerably. However, challenges such as the effect of multiple indexes due to hardware limitations are not mentioned. Additionally, a test against a second collection is mentioned but not ascribed, nor is the claim that BerkelyDB is a relational database correct. Overall, the paper is very difficult to understand due to the lack of clear graphics and the tendency to describe the more complex process prior to the basic reasons. Other minor annoyances involve the lack of an initial full document graph, overuse of the word “basically”, and changing the running query near the end of the paper.