jump to navigation

XML Structural Summaries and Microformats October 31, 2007

Posted by shahan in eclipse plugin, information retrieval, search engines, software architecture, software development, visualization, XML.
add a comment

From my experiences attempting to integrate microformats into XML structural summaries, the results have all been workarounds.

Microformats are integrated into an XHTML page through the ‘class’ attribute of an element. I won’t go into the issues with doing this and while the additional information embedded into the page is welcome, it doesn’t conform to the standardized integration model offered by XML. A good reference on integrating and pulling microformat information from a page is here.

Microformats are not easily retrieved from a page because there is no way to know ahead of time what formats are integrated into the page. A workaround in creating an XML structural summary based on microformats can be obtained by applying an extension of the XML element model to indexing attributes and furthermore their values (in order to identify differing attributes). Since the structural summaries being developed using AxPREs are based on XPath expressions, they will be able to handle microformats but with advanced planning on the user.

The screenshot below is of DescribeX with a P* summary of a collection of hCalendar files. Using Apache Lucene, the files are indexed to include regular text token, XML elements, XML attributes and their associatd values. On the right-hand side you can see a query has been entered searching using Lucene’s default regex ‘*event*’ to search for ‘class’ attributes that contain that term. The vertices in red represent the elements which contain it and while it would be nice to assume that the descendants of the highlighted vertices are related to hCalendar events, it is not the case.

Microformat highlighting using DescribeX

Search Standards and OpenID; not only for single sign-on, will search standards emerge? October 31, 2007

Posted by shahan in online social networks, search engines, software architecture, standards.
Tags: , ,
1 comment so far

OpenID can be the answer to a whole slew of online profile questions. Not only can it answer, “how can I sign on to all these sites using my existing profile?”, it offers the possibility of answering, “How can I search this website using my existing preferences?”.
OpenID is a single sign on architecture created by Janrain which enables users to use an existing account supporting OpenID to access other websites that also support OpenID, thereby removing the need to create separate accounts on each site. It is a secure method for passing account details from one site to the other and differs from a password manager (either software or online) that hosts your different usernames and passwords for each site. Allowing your profile to be stored and represented online, you have the ability to use your existing information quickly and easily.

Despite Stefan Brands’ in-depth analysis of the problems that may arise with OpenID, OpenID is a good solution. Not only because of the ease of authentication, but also because it’s a secure way of storing a profile online. WordPress has OpenID by default (more info here). With the number of search engines emerging that do different things with different methods, I predict the rise of search standards and profiles.

A simple definition of Search Standard: The method and the properties which enable a user to search content.

These can cover search-engine relevant properties (which can be translated into accepted user-preferences) like:

  • sources, e.g., blogs, news, static webpages
  • metric ranges, e.g., > 80% precision or recall
  • content creation date
  • last indexed or updated

This is only opening the door to many areas in search engines and associated user preferences. By having these standards, it modifies the role of the search engine from dealing with the interface and presentation to the user, to that of a web service (an actual engine) which can be exploited by combining it with other search engines. By having these preferences, it addresses one of the biggest concerns when dealing with users, understanding and identifying what they prefer. As the number of search engines increases, the search engine market will no longer be as horizontal as it has been, but will become more hierarchical as each specializes in its niche. Combinations of search parameters may prove to be beneficial as the number and type of content increases, further encouraging the divergent expression of users on the web.

WordPress and Google Indexing January 11, 2007

Posted by shahan in search engines.
add a comment

I moved this blog from livejournal to WordPress around mid-November mainly due to the additional features offered such as number of viewers, trackbacks, the nicer interface, and Livejournal’s seemingly wierd-acting RSS. This was under the assumption that my readership will grow from zero, but this is only temporary which I can attribute to the lack of trackback functionality. As an experiment I did a quick search today to see whether this blog had yet made it into the Google indexes but surprisingly it had not. This is unlike Livejournal’s where my blog had been indexed by several automated feed collectors and Google had then indexed me indirectly through them. No assumptions as to why it’s like this.