jump to navigation

The value of the semantic web. RDF$? November 6, 2007

Posted by shahan in information retrieval, internet artchitecture, online social networks, openid, semantic web, standards, Uncategorized.
add a comment

The question that this entry seeks to answer is, “Using the semantic web, what resources are available that have meaningful marketable value?”.

While the value of the semantic web has been touted, marketable value is not as widely discussed. However; in order to encourage Google to develop an OpenRDF API, they need to see what it can do for them. In my previous post about Search Standards, I mentioned measurement of a person’s search preferences, such as type of content to search and metric ranges, is key to improving results. Combining Greg Wilson’s post about Measurement with the value-of-data issues mentioned in Bob Warfield’s User-Contributed Data Auditing we now want to understand how to retrieve semantically marked-up content which has the ability to generate revenue.

User-generated semantic metrics are easily achieved with the semantic web. Further, semantic metrics can be tied together using various means, one of which is mentioned in Dan Connolly’s blog entry Units of measure and property chaining. It should be noted that, due to the extensibility of semantic data, the value or metrics are independent of any specifics, thus allowing it to be used for trust metrics as well.

There is a general use case which describes what I mean:

  1. Content is made available. The quality is not called into question, yet.
  2. The content is semantically marked up so that it has properties that mean something.
  3. Other users markup the content even further but with personally-relevant properties that can be created by themselves or using an existing schema (e.g. available from their employer) which can be associated through their online identity OpenID and can be extended with their social network through Google’s OpenSocial API.

The data has now been extended from being searchable for relevant content using existing methods to becoming searchable using user-generated value metrics. These can then be leveraged, similar to Google Coop, and with further benefit if search standards were available.

If a group was selected based on their ability to identify and rank relevant content based on not by the content contained, but by the value associated with the properties of that content, the idea of relevant content no longer becomes whether the content itself is relevant to the person evaluating it, but whether the properties would be relevant to someone searching for those properties. This potentially has the ability to remove bias from relevance evaluation. No longer is content being evaluated for what it is but what it is perceived as, and the metrics from paid users as well as the users who view the content for their own or standard metrics is easily expandable and searchable by others, an architecture permitting growth beyond limited views.


XML Structural Summaries and Microformats October 31, 2007

Posted by shahan in eclipse plugin, information retrieval, search engines, software architecture, software development, visualization, XML.
add a comment

From my experiences attempting to integrate microformats into XML structural summaries, the results have all been workarounds.

Microformats are integrated into an XHTML page through the ‘class’ attribute of an element. I won’t go into the issues with doing this and while the additional information embedded into the page is welcome, it doesn’t conform to the standardized integration model offered by XML. A good reference on integrating and pulling microformat information from a page is here.

Microformats are not easily retrieved from a page because there is no way to know ahead of time what formats are integrated into the page. A workaround in creating an XML structural summary based on microformats can be obtained by applying an extension of the XML element model to indexing attributes and furthermore their values (in order to identify differing attributes). Since the structural summaries being developed using AxPREs are based on XPath expressions, they will be able to handle microformats but with advanced planning on the user.

The screenshot below is of DescribeX with a P* summary of a collection of hCalendar files. Using Apache Lucene, the files are indexed to include regular text token, XML elements, XML attributes and their associatd values. On the right-hand side you can see a query has been entered searching using Lucene’s default regex ‘*event*’ to search for ‘class’ attributes that contain that term. The vertices in red represent the elements which contain it and while it would be nice to assume that the descendants of the highlighted vertices are related to hCalendar events, it is not the case.

Microformat highlighting using DescribeX

Demonstrating DescribeX and VisTopK at IBM CASCON Technology Showcase 2007 October 4, 2007

Posted by shahan in conference, eclipse plugin, GEF, information retrieval, visualization, XML.
Tags: , , , ,
add a comment

I’m happy to say that two projects that I work on, DescribeX (a team effort with Sadek Ali and Flavio Rizzolo) and VisTop-k, both of which are supervised by Dr. Mariano Consens, will be demonstrated at IBM’s CASCON Technology Showcase on October 22 – 25, 2007. There were quite a few interesting projects last year and I’m looking forward to seeing what new ideas have arisen, especially since my Eclipse plugin skills have increased a tremendous amount. As a student I’m also looking forward to the food 😉

Link to CASCON

VisTopK Screenshot Available January 11, 2007

Posted by shahan in eclipse, eclipse plugin, GEF, information retrieval, visualization.
add a comment

Although there was a large break in between VisTopK-related posts, the project is now complete. A labelled screenshot is provided for your benefit. The report will be uploaded soon as well. I’m very happy with the result and am excited by the possibilities offered by the plug-in as it allows integration of many other projects, existing and new.

Screenshot of VisTopK


I’ve created a screencast of VisTopK in action using the great application Wink, originally referenced from Greg’s blog entry, but WordPress doesn’t allow Flash (SWF) uploads. Anyone have any ideas on how/where I can post it? I tried converting it to an AVI to maybe post it to YouTube but the size of 1GB stopped that attempt cold.

TReX and XSummary November 5, 2006

Posted by shahan in information retrieval, software development, XML.
add a comment

Currently, as part of my Research Assistanceship supervised by Dr. Consens, I have worked with several of the existing code bases. One is called TReX which was used in the Initiative for the Evaluation of XML Retrieval (INEX). INEX is a global effort consisting of over 50 groups who participate by working on a common XML document collection (Wikipedia this year, IEEE articles last year). The sharing of results promotes an open research environment and also helps direct future research initiatives. I am in the process of refactoring and separating the implementation of TReX into discrete and modular components. The refactoring process is challenging but very rewarding as it requires understanding not only how tightly integrated the data structures are with each other, but also what the code is actually doing and why. One tool I found very helpful in understanding the system as a whole is an Eclipse plug-in called Creole. It provides the ability to visualize the java packages, classes, method calls, and even view the source code all from within a common interface with boxes and arrows. The most useful feaure applied against TReX was the ability to view the building of the code through the CVS check-ins. Further, a cross-listed course (one which is both an undergraduate and graduate level) I took this summer of 06, Software Architecture and Design taught by Greg Wilson, was extremely useful as it taught how software patterns can help prevent the problems facing the code currently. Greg taught ways of thinking about the structure of software in order to allow for effective expansion. The paper Growing a Language by Guy L. Steele Jr. is an excellent read which describes the concept of growth, while not directed towards software design, the concepts and notions are equally applicable. It is also a very easy read which may at first glance seem very strange.

A second code base which I have worked with is called XSummary, which summarizes the structure of XML document collections. It was demonstrated at the Technology Showcase at IBM’s CASCON 2006 in mid-October. It is an Eclipse plug-in developed using the Zest visualization toolkit. It presented structural summaries of various XML document collection applicable to Wikipedia, BPEL, blog feeds in RSS and Atom. It depicts the parent-child relationships of XML structural elements and allows addition of detail through displaying of different summaries on a particlular element. XSummary is developed by Flavio Rizzolo. I also integrated a coverage and reachability model developed by Sadek Ali. Coverage and reachability provide a way of identifying which elements are considered important within the colleciton with the ability to specify a range, thus simplying the detail level by displaying only the elements within a slider-selected range.

Top-K Results and Threshold Algorithms November 3, 2006

Posted by shahan in information retrieval.
add a comment

Top-K retrieval results are simply a method of requesting the top k elements from a collection, akin to saying the top 10 sellers on eBay. A Top-K algorithm is also known as a Threshold Algorithm (TA), one which terminates when a certain threshold is achieved. The threshold in IR is based on the inverted index maintained on a particular keyword (associating a relevance value with they keyword’s location, as mentioned in a previous blog entry). The simplest example: if looking for keyword k in an inverted index R containing n elements, where k < n, then by simply sorting R in descending order, we can retrieve the first k elements, resulting in top K. TAs are an extension to performing this task over several lists and provide various optimizations based on available features of the inverted indexes. One important feature is the ability to perform only sequential scans, random access, or a combination. An example of a situation where random access is not available is when basing the inverted index on a list that may change and is not directly accessible, parsing of Google search results for instance. More information on the features and optimization of TAs will be provided later.

Introduction to Information Retrieval (IR) September 21, 2006

Posted by shahan in information retrieval.
add a comment

It all begins with parsing a document, tokenizing the words that do not appear on a blacklist (common words such as ‘the’, ‘it’), and attaching a rank to these words based on an algorithm.

There are several algorithms available, the gist of which is: if a word appears many times in a document, then it’s important. Algorithms will be covered later.

Once this information is obtained (a list of words and their importance), then they need to be linked to the documents in which they appear. For this, Inverted Indexes [Wikipedia] come in handy, though the representation in Wikipedia is one of a few different ways. Different representations will be covered later and there are enhancements available which will provide different services such as highlighting or taking advantage of document structure. Of course there are tradeoffs between performance and disk space which makes it all the more interesting.