Precis: Neoklis Polyzotis, Minos Garofalakis – “Statistical Synopses for Graph-Structured XML Databases” August 11, 2007Posted by shahan in precis.
Neoklis Polyzotis, Minos Garofalakis
“Statistical Synopses for Graph-Structured XML Databases”
ACM SIGMOD 2002 June 4-6, Madison, Wisonsin, USA
The optimization of pattern-specific queries depends on estimations of twig existence in the original XML graph structure. Many current synopses provide a concise interpretation of the XML data graph rather than a quantitative metric useful for the optimization of path evaluations. XSketch synopses provide a limited space estimation framework for efficient compile-time selectivity estimates for path expressions over XML data graphs.
An XSketch synopsis is a graph of nodes grouped by label, containing only extent sizes. Each node represents a set of paths from the original XML data graph, and each edge in the synopsis is described either as a backward-stable or a forward-stable edge relation, or neither. Calculation of selectivity is based on the Chain Rule from probability theory, which itself requires four key assumptions involving structural paths: Path Independence, Backward-Edge Uniformity, Branch-Independence, and Forward-Edge Uniformity. The construction of an XSketch synopsis is NP-hard and thus relies on a greedy, forward-selection paradigm. It begins with a label-split graph and proceeds using the marginal gains from three local refinement operations: b-stabilize, f-stabilize, and b-split. The average absolute relative error between the graph counts estimated from the XSketch synopses and those from the real XML data provides a metric a metric for the evaluation of quality.
The experimental results show that XSketch synopses are 1% to 5% the size of a B/F-bisimilar summary, which represents the collections complete graph, while maintaining an effective error rate of 10% for complex path expressions and an error rate as low as 3% for simple path expression. While there is a potentially useful tradeoff between the synopsis size and selectivity error rates, the path and branching independence means that the structure of the XML data graph is relatively random, which is generally not the case with collections of text (e.g., a section title is usually followed by a paragraph). The paper is challenging to read due to its technicality but rewarding, leading to an understanding of the authors’ novel, concise, and well-organized methods for applying statistics and probability theory to the structure of XML data graphs.