jump to navigation

Precis: Wiemin He, Leonidas Fegaras, and David Levine – “Indexing and Searching XML Documents based on Content and Structure Synopses” August 16, 2007

Posted by shahan in precis.
add a comment

Wiemin He, Leonidas Fegaras, and David Levine

“Indexing and Searching XML Documents based on Content and Structure Synopses”

BNCOD 2007, Glasgow, July 2007

 

Information retrieval from XML data is usually performed by creating an inverted index for each text-containing element. If specifying path constraints is desired, the structure of the XML documents must be also be maintained. Allowing full-text searching of XML documents with the ability to return individual elements tends to generate very large indexes, which adversely affects space and time costs.

Various containment filtering methods are compared. These are applied against bit matrix data structures representing the different aspects involved in information retrieval from XML collections: XML document structures (Structural Summary), term location (Content Synopses), and path relations (Positional Filters).

An interesting use of two-dimensional Bloom filters are described in its application with XML document collections. Optimizations of their algorithms by using a hash-based, as well as, a novel two-phase containment filter are demonstrated.

The well-described experiments show that the novel combination of methods improves space and time constraints considerably. However, challenges such as the effect of multiple indexes due to hardware limitations are not mentioned. Additionally, a test against a second collection is mentioned but not ascribed, nor is the claim that BerkelyDB is a relational database correct. Overall, the paper is very difficult to understand due to the lack of clear graphics and the tendency to describe the more complex process prior to the basic reasons. Other minor annoyances involve the lack of an initial full document graph, overuse of the word “basically”, and changing the running query near the end of the paper.

Precis: Sebastian Michel, Peter Triantafillou, Gerhard Weikum – “KLEE: A Framework for Distributed Top-k Query Algorithms” August 11, 2007

Posted by shahan in precis.
add a comment

Sebastian Michel, Peter Triantafillou, Gerhard Weikum

“KLEE: A Framework for Distributed Top-k Query Algorithms”

Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005

 

 

The family of threshold algorithms offer efficient processing of top-k queries in information retrieval involving text and semi-structured data. Previous work have shown that in order for a distributed top-k algorithm to be efficient, it must limit the number of communication phases between peers.

To achieve this goal while providing a worthwhile balance between the execution cost and the search result accuracy, KLEE combines the known cost-management techniques of random accesses, network latency, and score approximations with two summarizing data structures: HistogramBlooms and Candidate List Filters.

Each peer contains undivided term index lists and their corresponding HistogramBlooms data structure, a bloom filter-based histogram describing an index list’s score distribution. This compact representation provides the query initiating peer with the average scores for documents not yet encountered, thus reducing bandwidth transfer and improving score approximation in the top-k score calculation. Optionally, the initiating peer may use optimization by retrieving a bitmap of each peer’s candidate list to create a Candidate List Filter Matrix and can then request the definite scores only of documents that appear on a relevant number of candidate lists. KLEE provides an efficient distributed top-k algorithm in two or three communication phases by first retrieving the HistogramBlooms from each of the peers, then optionally estimating the benefit of constructing a Candidate List Filter Matrix, and finally retrieving the candidate list with the actual scores.

The results show dramatic improvement over existing methods, with controllable error ratios, reduced bandwidth, and response times. The configurability of KLEE, achieved by adjusting the number of bits used for the HistogramBlooms and Candidate Filters and its independence from the score distributions of the source document collection, the number of terms/peers, and the term correlations, all contribute to its success.

Precis: Neoklis Polyzotis, Minos Garofalakis – “Statistical Synopses for Graph-Structured XML Databases” August 11, 2007

Posted by shahan in precis.
add a comment

Neoklis Polyzotis, Minos Garofalakis

“Statistical Synopses for Graph-Structured XML Databases”

ACM SIGMOD 2002 June 4-6, Madison, Wisonsin, USA

 

The optimization of pattern-specific queries depends on estimations of twig existence in the original XML graph structure. Many current synopses provide a concise interpretation of the XML data graph rather than a quantitative metric useful for the optimization of path evaluations. XSketch synopses provide a limited space estimation framework for efficient compile-time selectivity estimates for path expressions over XML data graphs.

 

An XSketch synopsis is a graph of nodes grouped by label, containing only extent sizes. Each node represents a set of paths from the original XML data graph, and each edge in the synopsis is described either as a backward-stable or a forward-stable edge relation, or neither. Calculation of selectivity is based on the Chain Rule from probability theory, which itself requires four key assumptions involving structural paths: Path Independence, Backward-Edge Uniformity, Branch-Independence, and Forward-Edge Uniformity. The construction of an XSketch synopsis is NP-hard and thus relies on a greedy, forward-selection paradigm. It begins with a label-split graph and proceeds using the marginal gains from three local refinement operations: b-stabilize, f-stabilize, and b-split. The average absolute relative error between the graph counts estimated from the XSketch synopses and those from the real XML data provides a metric a metric for the evaluation of quality.

 

The experimental results show that XSketch synopses are 1% to 5% the size of a B/F-bisimilar summary, which represents the collections complete graph, while maintaining an effective error rate of 10% for complex path expressions and an error rate as low as 3% for simple path expression. While there is a potentially useful tradeoff between the synopsis size and selectivity error rates, the path and branching independence means that the structure of the XML data graph is relatively random, which is generally not the case with collections of text (e.g., a section title is usually followed by a paragraph). The paper is challenging to read due to its technicality but rewarding, leading to an understanding of the authors’ novel, concise, and well-organized methods for applying statistics and probability theory to the structure of XML data graphs.

Precis: Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum – “A Time Machine for Text Search” August 2, 2007

Posted by shahan in precis.
add a comment

Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum

“A Time Machine for Text Search”

SIGIR ’07, July 23-27, 2007, Amsterdam, The Netherlands

 

There is little research in the area of information retrieval from multi-version documents as of a particular time. There is related work in the area of efficient data structures for conducting time-range based queries; however, their application to information retrieval taskssuch as document scoring has not be examined closely. Efficient data structures relevant to information retrieval from multi-version text documents with the ability to score and rank documents as of a particular time will be useful for searching web archives, such as examining the progression of Wikis.

By including additional time-frame information in a token’s posting list, a Time-Travel Inverted File Index extends existing scoring measures to allow querying for a ranked document at a particular time. To reduce the size of the generated index over multiple versions, Temporal Coalescing and Sublist Materialization methods were developed by the authors. Temporal Coalescing allows the posting lists of terms over a contiguous set of timeframes to be approximated within arbitrary error bounds; Sublist Materialization allows the posting list of only the queried timeframes to be regenerated from the temporally coalesced inverted index.

With a relative error bound of 0.01, temporal coalescing reduced the average index size by approximately 85%, showing its effectiveness. Queries applied against the collection using this same error bound resulted in an effective recall level of at least 95%. The authors have demonstrated the effectiveness of their techniques through their experimental results.

While the research is beneficial, the paper is slightly difficult to read due to the lack of detailed graphics. In several areas, the textual descriptions are ambiguous, with more than one occurrence of a term failing to be defined until later in the paper. Moreover, some words were not well selected, such as the phrase “and demand [some property]” when explaining a formula.

Computer Science Writing Course August 2, 2007

Posted by shahan in precis.
add a comment

I’ll be posting some of the writings I’ve done as part of the Computer Science Writing Course for graduate students at the University of Toronto. The course is structured such that students, from their own area of research, write and present: six precis of articles selected by the student or their supervisor, a literature review, an analytic comparison of data, a ten-minute in-class presentation, and a poster session for invited professors.

The valuable instruction covers: English language usage; grammar; avoiding common mistakes; and improving structure and style. Greg Wilson has often suggested that students should post their critiques and writings while others digress and indicate that students’ lack of experience and knowledge may seed inaccurate information as well as incite negative impressions. I now agree that writings should be posted online, and any dissenting posts can guide the student in future work.

Follow

Get every new post delivered to your Inbox.