ICDE 2008 DescribeX Demonstration April 10, 2008
Posted by shahan in Uncategorized.Tags: automaton, automaton intersection, Brics, describex, eclipse, GEF, refinement, structural summary, XML, Zest
add a comment
This post is an outline of the DescribeX and the demonstration at ICDE 2008. The 4-page demonstration submission will be available soon.
UPDATE: The submission is available online here.
DescribeX is a graphical Eclipse plugin for interacting with structural summaries of XML collections. It is developed in Java using GEF, Zest (now incorporated into GEF), Brics (a Java automaton library), and Apache Lucene (a Java information retrieval library). The structural summaries are defined using an axis path regular expression (AxPRE).
Several versions have been developed, each new version allowing a different type of summary as well as different interactions with the summary.
The oldest version, originally developed for Cascon 2006, created a P* summary (or F&B-Index) and thus created the structural summary as a tree. A tree graph layout algorithm from GEF was used. Only a P*C refinement was available using XPath expressions evaluated against all the files in the collection. The control panel for this version is on the bottom on the left.
The second version allowed the creation of an A(k)-index, allowing the user to specify the height in the path for which to consider when creating the summary partitions. This used Zest (now incorporated into GEF) for the layout algorithm due since a structural summary based on the A(k)-index can create a graph instead of a tree.
The third version implements the true AxPRE expressions, using the Brics automaton Java library for converting the regular expression to a NFA. A label summary was created of the collection and refinements were processed by intersecting the NFA of the regular expression with the automaton of the label summary. Zest was also used for the layout algorithm. The control panel for this version is in the middle on the left side.
The differences between the versions are in the extra features such as the additional filters such as coverage and highlighting elements from a keyword query.
The key points of the demonstration are that our tool allows a user to quickly and easily determine the paths that exist in the collection, determine the importance of summary nodes, as well as interact with the structural summary by performing refinements. An additional aspect is the ability to highlight the elements that contain the terms in keyword search, this is in relation to our participation in INEX.
The attached screenshot shows three graphs, the topmost and middle graphs are P* structural summaries (or F&B-indexes) of two protein-protein interaction (PPI) datasets conforming to the PSI-MI schema standard. These two graphs are based on the first version and shows the important nodes coloured green using a coverage value of 50%, i.e. showing the nodes that together contain 50% of the entire collection’s total number of extents. Other coverage measures are easily available (such as a random walk coverage) and easily implementable. The first (topmost) dataset, HPRD, is a single 60MB XML file while the second (middle) dataset, Intact, is a collection of 6 XML files totalling 20MB. It should be noted that these are only a small subset of the gigabyte size collections available. We can see that the structure of the larger HPRD collection has a smaller structure in use than the Intact collection.
I obtained some very good feedback after demonstrating DescribeX to several of the attendees. Some of the feedback included displaying cardinalities as well as displaying the information retrieval component while using summaries. It would have been nice to show how the scoring of a document would have been affected if some of the summary nodes were refined using an AxPRE to combine elements containing the search term. Next time I hope to allow the user to use the plugin to prod the product, “It’s like walking the high wire without a safety net” as Guy Lohman put it.
Future work involves preparing a downloadable plugin for interested users. As it stands, the three versions can be made available and can work alongside each other (and actually the third version requires the first version); however, the instructions for use have not been updated in a while (though the application is easy to use). There is also a lack of extensibility of the newer version since I would like to update the way in which the extension point for filters and coverage are implemented.
Want to comment on Tim Berners-Lee’s blog? Here’s how November 2, 2007
Posted by shahan in openid, semantic web.2 comments
It’s very easy. The Decentralized Information Group (DIG) is where you can find a bit of information on what’s being rolled out regarding the combined use of rdf and openid and is also host to several blogs. In order to comment, wise techniques have been implemented to block spammers through the use of openid, rdf, and a basic trust metric. Before someone can login to post, the person must be placed on a whitelist. You cannot create an account on the site; openid is used to login. To compute the basic trust metric of being known within 2 degrees of separation (a person at DIG knows someone who knows someone), you require a FOAF file. The following is a list of steps I took to get whitelisted:
1. WordPress provides an openid url for me, it’s the address of my blog; http://vannevarvision.wordpress.com
2. I generated a FOAF file through the FOAF-a-matic.
3. I copied and pasted the generated rdf from step-2 into a text file called foaf.rdf, and added the line
<foaf:openid rdf:resource="http://vannevarvision.wordpress.com/"/>
before the line
</foaf:Person>
NOTE: this requirement may be removed in the future to use the homepage property instead of the openid property
4. I saved the file, uploaded it to my homepage, and to ensure that Apache Web Server would provide the correct content-type for the rdf file, I added the following line to my .htaccess file:
AddType application/rdf+xml rdf
5. I joined the Semantic Web Interest Group’s IRC channel, where I asked whether anyone would be kind enough to add me to their ‘knows’ list in their own FOAF properties.
6. Sean B. Palmer(sbp) and Dan Connolly (DanC) were kind enough to look at my blog to see that I don’t have spammer intentions so Sean added me to his FOAF, validated it, then reran the script on the blog server to add me to the whitelist.
7. I’m now able to login to the DIG site using my openid url
It was a very easy and quick process though I had the advantage of a blog dating from last year with a few posts on XML and microformats, not entirely out of scope from the semantic web community. Thanks to sbp and DanC for their help.
Recommended References:
FOAF and OpenID: two great tastes that taste great together by Dan Connolly
Whitelisting blog post by Sean B. Palmer
XML Structural Summaries and Microformats October 31, 2007
Posted by shahan in XML, eclipse plugin, information retrieval, search engines, software architecture, software development, visualization.add a comment
From my experiences attempting to integrate microformats into XML structural summaries, the results have all been workarounds.
Microformats are integrated into an XHTML page through the ‘class’ attribute of an element. I won’t go into the issues with doing this and while the additional information embedded into the page is welcome, it doesn’t conform to the standardized integration model offered by XML. A good reference on integrating and pulling microformat information from a page is here.
Microformats are not easily retrieved from a page because there is no way to know ahead of time what formats are integrated into the page. A workaround in creating an XML structural summary based on microformats can be obtained by applying an extension of the XML element model to indexing attributes and furthermore their values (in order to identify differing attributes). Since the structural summaries being developed using AxPREs are based on XPath expressions, they will be able to handle microformats but with advanced planning on the user.
The screenshot below is of DescribeX with a P* summary of a collection of hCalendar files. Using Apache Lucene, the files are indexed to include regular text token, XML elements, XML attributes and their associatd values. On the right-hand side you can see a query has been entered searching using Lucene’s default regex ‘*event*’ to search for ‘class’ attributes that contain that term. The vertices in red represent the elements which contain it and while it would be nice to assume that the descendants of the highlighted vertices are related to hCalendar events, it is not the case.
Alternative Search Engines October 13, 2007
Posted by shahan in Uncategorized.Tags: information retrieval, search engines, semi-structured information
add a comment
In response to WebWorkerDaily’s article, none of the search engines listed include retrieval using structured information. Although I’m involved with information retrieval as part of my research, I don’t spend a lot of time exploring the search engines “out there”. The only reason I can give is that they haven’t done for me what Google already does with a little bit of query creativity. While searching news or blogs may have the benefit of limited scope, there’s no demonstration of added benefit.
A consequence of limiting search to a niche is that the popular terms within that niche become “boosted” automatically without being subsumed, e.g., by a larger news service or certain wiki. Another is that the rate of re-crawling already indexed pages can be better managed. I’ll make it a point to explore whether these search engines examine markup on the page when crawling though this is unlikely.
Currently my research efforts in information retrieval are over semi-structured document collections. Within our group we have been experimenting with boosts to certain structural elements and although our efforts have met slight improvements in the result rankings, there are a number of other tests to be run which I anticipate to reveal better boosting factors. The boosts thus far that we have experimented with have excluded subelement content lengths and are calculated as: sum, log(sum), 1/log(sum), avg, and no boosting. The boosting is based on a Markov Chain Model developed for Strucutural Relevance by Sadek Ali and shows great promise in using summaries.
Improving Blog Traffic October 11, 2007
Posted by shahan in Uncategorized.Tags: blogging
add a comment
As a relatively new blogger, I’ve often wondered how I want to portray my writings and have begun to make it a higher priority over the last few weeks. One of the best things about blogging is that it is a way to hold myself accountable publicly. I’m listing a few questions and their answers for what I see VannevarVision to be.
What am I blogging about?
internet, information retrieval, online social networks, some eclipse programming
Who is my audience?
researchers or those interested in the more technical details of the topics listed
Do I want readers to keep coming back?
of course, I think I have interesting things to say
What is my target post rate?
currently at least once a week, I will get this down to once a day.
Most Importantly… What is my motivation?
I have a voice, I have a pretty good idea of what I’m talking about, I will make a change somewhere that will affect readers like you. I have valuable experiences to draw from and I’d like to be remembered amongst the archives 100 years down the road when someone is digging through trying to piece my biography together to determine what kind of foods I ate, not to mention how many beers I drank. It’d be nice in the future for my kids when they’re looking through the old-school internet and see that I was serious about my work.
Why Now?
nothing like the present, I don’t need my forebrain smacked in the form of a wakeup call
Demonstrating DescribeX and VisTopK at IBM CASCON Technology Showcase 2007 October 4, 2007
Posted by shahan in GEF, XML, conference, eclipse plugin, information retrieval, visualization.Tags: cascon, conference, describex, ibm, vistopk
add a comment
I’m happy to say that two projects that I work on, DescribeX (a team effort with Sadek Ali and Flavio Rizzolo) and VisTop-k, both of which are supervised by Dr. Mariano Consens, will be demonstrated at IBM’s CASCON Technology Showcase on October 22 – 25, 2007. There were quite a few interesting projects last year and I’m looking forward to seeing what new ideas have arisen, especially since my Eclipse plugin skills have increased a tremendous amount. As a student I’m also looking forward to the food
