Precis: Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum – “A Time Machine for Text Search” August 2, 2007Posted by shahan in precis.
Klaus Berberich, Srikanta Bedathur, Thomas Neumann, Gerhard Weikum
“A Time Machine for Text Search”
SIGIR ’07, July 23-27, 2007, Amsterdam, The Netherlands
There is little research in the area of information retrieval from multi-version documents as of a particular time. There is related work in the area of efficient data structures for conducting time-range based queries; however, their application to information retrieval taskssuch as document scoring has not be examined closely. Efficient data structures relevant to information retrieval from multi-version text documents with the ability to score and rank documents as of a particular time will be useful for searching web archives, such as examining the progression of Wikis.
By including additional time-frame information in a token’s posting list, a Time-Travel Inverted File Index extends existing scoring measures to allow querying for a ranked document at a particular time. To reduce the size of the generated index over multiple versions, Temporal Coalescing and Sublist Materialization methods were developed by the authors. Temporal Coalescing allows the posting lists of terms over a contiguous set of timeframes to be approximated within arbitrary error bounds; Sublist Materialization allows the posting list of only the queried timeframes to be regenerated from the temporally coalesced inverted index.
With a relative error bound of 0.01, temporal coalescing reduced the average index size by approximately 85%, showing its effectiveness. Queries applied against the collection using this same error bound resulted in an effective recall level of at least 95%. The authors have demonstrated the effectiveness of their techniques through their experimental results.
While the research is beneficial, the paper is slightly difficult to read due to the lack of detailed graphics. In several areas, the textual descriptions are ambiguous, with more than one occurrence of a term failing to be defined until later in the paper. Moreover, some words were not well selected, such as the phrase “and demand [some property]” when explaining a formula.