Precis: Sebastian Michel, Peter Triantafillou, Gerhard Weikum – “KLEE: A Framework for Distributed Top-k Query Algorithms” August 11, 2007Posted by shahan in precis.
Sebastian Michel, Peter Triantafillou, Gerhard Weikum
“KLEE: A Framework for Distributed Top-k Query Algorithms”
Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005
The family of threshold algorithms offer efficient processing of top-k queries in information retrieval involving text and semi-structured data. Previous work have shown that in order for a distributed top-k algorithm to be efficient, it must limit the number of communication phases between peers.
To achieve this goal while providing a worthwhile balance between the execution cost and the search result accuracy, KLEE combines the known cost-management techniques of random accesses, network latency, and score approximations with two summarizing data structures: HistogramBlooms and Candidate List Filters.
Each peer contains undivided term index lists and their corresponding HistogramBlooms data structure, a bloom filter-based histogram describing an index list’s score distribution. This compact representation provides the query initiating peer with the average scores for documents not yet encountered, thus reducing bandwidth transfer and improving score approximation in the top-k score calculation. Optionally, the initiating peer may use optimization by retrieving a bitmap of each peer’s candidate list to create a Candidate List Filter Matrix and can then request the definite scores only of documents that appear on a relevant number of candidate lists. KLEE provides an efficient distributed top-k algorithm in two or three communication phases by first retrieving the HistogramBlooms from each of the peers, then optionally estimating the benefit of constructing a Candidate List Filter Matrix, and finally retrieving the candidate list with the actual scores.
The results show dramatic improvement over existing methods, with controllable error ratios, reduced bandwidth, and response times. The configurability of KLEE, achieved by adjusting the number of bits used for the HistogramBlooms and Candidate Filters and its independence from the score distributions of the source document collection, the number of terms/peers, and the term correlations, all contribute to its success.