jump to navigation

Introduction to Information Retrieval (IR) September 21, 2006

Posted by shahan in information retrieval.

It all begins with parsing a document, tokenizing the words that do not appear on a blacklist (common words such as ‘the’, ‘it’), and attaching a rank to these words based on an algorithm.

There are several algorithms available, the gist of which is: if a word appears many times in a document, then it’s important. Algorithms will be covered later.

Once this information is obtained (a list of words and their importance), then they need to be linked to the documents in which they appear. For this, Inverted Indexes [Wikipedia] come in handy, though the representation in Wikipedia is one of a few different ways. Different representations will be covered later and there are enhancements available which will provide different services such as highlighting or taking advantage of document structure. Of course there are tradeoffs between performance and disk space which makes it all the more interesting.



No comments yet — be the first.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: