XML Structural Summaries and Microformats October 31, 2007Posted by shahan in eclipse plugin, information retrieval, search engines, software architecture, software development, visualization, XML.
add a comment
From my experiences attempting to integrate microformats into XML structural summaries, the results have all been workarounds.
Microformats are integrated into an XHTML page through the ‘class’ attribute of an element. I won’t go into the issues with doing this and while the additional information embedded into the page is welcome, it doesn’t conform to the standardized integration model offered by XML. A good reference on integrating and pulling microformat information from a page is here.
Microformats are not easily retrieved from a page because there is no way to know ahead of time what formats are integrated into the page. A workaround in creating an XML structural summary based on microformats can be obtained by applying an extension of the XML element model to indexing attributes and furthermore their values (in order to identify differing attributes). Since the structural summaries being developed using AxPREs are based on XPath expressions, they will be able to handle microformats but with advanced planning on the user.
The screenshot below is of DescribeX with a P* summary of a collection of hCalendar files. Using Apache Lucene, the files are indexed to include regular text token, XML elements, XML attributes and their associatd values. On the right-hand side you can see a query has been entered searching using Lucene’s default regex ‘*event*’ to search for ‘class’ attributes that contain that term. The vertices in red represent the elements which contain it and while it would be nice to assume that the descendants of the highlighted vertices are related to hCalendar events, it is not the case.
Alternative Search Engines October 13, 2007Posted by shahan in Uncategorized.
Tags: information retrieval, search engines, semi-structured information
add a comment
In response to WebWorkerDaily’s article, none of the search engines listed include retrieval using structured information. Although I’m involved with information retrieval as part of my research, I don’t spend a lot of time exploring the search engines “out there”. The only reason I can give is that they haven’t done for me what Google already does with a little bit of query creativity. While searching news or blogs may have the benefit of limited scope, there’s no demonstration of added benefit.
A consequence of limiting search to a niche is that the popular terms within that niche become “boosted” automatically without being subsumed, e.g., by a larger news service or certain wiki. Another is that the rate of re-crawling already indexed pages can be better managed. I’ll make it a point to explore whether these search engines examine markup on the page when crawling though this is unlikely.
Currently my research efforts in information retrieval are over semi-structured document collections. Within our group we have been experimenting with boosts to certain structural elements and although our efforts have met slight improvements in the result rankings, there are a number of other tests to be run which I anticipate to reveal better boosting factors. The boosts thus far that we have experimented with have excluded subelement content lengths and are calculated as: sum, log(sum), 1/log(sum), avg, and no boosting. The boosting is based on a Markov Chain Model developed for Strucutural Relevance by Sadek Ali and shows great promise in using summaries.
Improving Blog Traffic October 11, 2007Posted by shahan in Uncategorized.
add a comment
As a relatively new blogger, I’ve often wondered how I want to portray my writings and have begun to make it a higher priority over the last few weeks. One of the best things about blogging is that it is a way to hold myself accountable publicly. I’m listing a few questions and their answers for what I see VannevarVision to be.
What am I blogging about?
internet, information retrieval, online social networks, some eclipse programming
Who is my audience?
researchers or those interested in the more technical details of the topics listed
Do I want readers to keep coming back?
of course, I think I have interesting things to say
What is my target post rate?
currently at least once a week, I will get this down to once a day.
Most Importantly… What is my motivation?
I have a voice, I have a pretty good idea of what I’m talking about, I will make a change somewhere that will affect readers like you. I have valuable experiences to draw from and I’d like to be remembered amongst the archives 100 years down the road when someone is digging through trying to piece my biography together to determine what kind of foods I ate, not to mention how many beers I drank. It’d be nice in the future for my kids when they’re looking through the old-school internet and see that I was serious about my work.
nothing like the present, I don’t need my forebrain smacked in the form of a wakeup call
Tags: cascon, conference, describex, ibm, vistopk
add a comment
I’m happy to say that two projects that I work on, DescribeX (a team effort with Sadek Ali and Flavio Rizzolo) and VisTop-k, both of which are supervised by Dr. Mariano Consens, will be demonstrated at IBM’s CASCON Technology Showcase on October 22 – 25, 2007. There were quite a few interesting projects last year and I’m looking forward to seeing what new ideas have arisen, especially since my Eclipse plugin skills have increased a tremendous amount. As a student I’m also looking forward to the food 😉