Introduction to Information Retrieval (IR) September 21, 2006Posted by shahan in information retrieval.
It all begins with parsing a document, tokenizing the words that do not appear on a blacklist (common words such as ‘the’, ‘it’), and attaching a rank to these words based on an algorithm.
There are several algorithms available, the gist of which is: if a word appears many times in a document, then it’s important. Algorithms will be covered later.
Once this information is obtained (a list of words and their importance), then they need to be linked to the documents in which they appear. For this, Inverted Indexes [Wikipedia] come in handy, though the representation in Wikipedia is one of a few different ways. Different representations will be covered later and there are enhancements available which will provide different services such as highlighting or taking advantage of document structure. Of course there are tradeoffs between performance and disk space which makes it all the more interesting.