jump to navigation

XML Structural Summaries and Microformats October 31, 2007

Posted by shahan in eclipse plugin, information retrieval, search engines, software architecture, software development, visualization, XML.
add a comment

From my experiences attempting to integrate microformats into XML structural summaries, the results have all been workarounds.

Microformats are integrated into an XHTML page through the ‘class’ attribute of an element. I won’t go into the issues with doing this and while the additional information embedded into the page is welcome, it doesn’t conform to the standardized integration model offered by XML. A good reference on integrating and pulling microformat information from a page is here.

Microformats are not easily retrieved from a page because there is no way to know ahead of time what formats are integrated into the page. A workaround in creating an XML structural summary based on microformats can be obtained by applying an extension of the XML element model to indexing attributes and furthermore their values (in order to identify differing attributes). Since the structural summaries being developed using AxPREs are based on XPath expressions, they will be able to handle microformats but with advanced planning on the user.

The screenshot below is of DescribeX with a P* summary of a collection of hCalendar files. Using Apache Lucene, the files are indexed to include regular text token, XML elements, XML attributes and their associatd values. On the right-hand side you can see a query has been entered searching using Lucene’s default regex ‘*event*’ to search for ‘class’ attributes that contain that term. The vertices in red represent the elements which contain it and while it would be nice to assume that the descendants of the highlighted vertices are related to hCalendar events, it is not the case.

Microformat highlighting using DescribeX


Search Standards and OpenID; not only for single sign-on, will search standards emerge? October 31, 2007

Posted by shahan in online social networks, search engines, software architecture, standards.
Tags: , ,
1 comment so far

OpenID can be the answer to a whole slew of online profile questions. Not only can it answer, “how can I sign on to all these sites using my existing profile?”, it offers the possibility of answering, “How can I search this website using my existing preferences?”.
OpenID is a single sign on architecture created by Janrain which enables users to use an existing account supporting OpenID to access other websites that also support OpenID, thereby removing the need to create separate accounts on each site. It is a secure method for passing account details from one site to the other and differs from a password manager (either software or online) that hosts your different usernames and passwords for each site. Allowing your profile to be stored and represented online, you have the ability to use your existing information quickly and easily.

Despite Stefan Brands’ in-depth analysis of the problems that may arise with OpenID, OpenID is a good solution. Not only because of the ease of authentication, but also because it’s a secure way of storing a profile online. WordPress has OpenID by default (more info here). With the number of search engines emerging that do different things with different methods, I predict the rise of search standards and profiles.

A simple definition of Search Standard: The method and the properties which enable a user to search content.

These can cover search-engine relevant properties (which can be translated into accepted user-preferences) like:

  • sources, e.g., blogs, news, static webpages
  • metric ranges, e.g., > 80% precision or recall
  • content creation date
  • last indexed or updated

This is only opening the door to many areas in search engines and associated user preferences. By having these standards, it modifies the role of the search engine from dealing with the interface and presentation to the user, to that of a web service (an actual engine) which can be exploited by combining it with other search engines. By having these preferences, it addresses one of the biggest concerns when dealing with users, understanding and identifying what they prefer. As the number of search engines increases, the search engine market will no longer be as horizontal as it has been, but will become more hierarchical as each specializes in its niche. Combinations of search parameters may prove to be beneficial as the number and type of content increases, further encouraging the divergent expression of users on the web.

Alternative Search Engines October 13, 2007

Posted by shahan in Uncategorized.
Tags: , ,
add a comment

In response to WebWorkerDaily’s article, none of the search engines listed include retrieval using structured information. Although I’m involved with information retrieval as part of my research, I don’t spend a lot of time exploring the search engines “out there”. The only reason I can give is that they haven’t done for me what Google already does with a little bit of query creativity. While searching news or blogs may have the benefit of limited scope, there’s no demonstration of added benefit.

A consequence of limiting search to a niche is that the popular terms within that niche become “boosted” automatically without being subsumed, e.g., by a larger news service or certain wiki. Another is that the rate of re-crawling already indexed pages can be better managed. I’ll make it a point to explore whether these search engines examine markup on the page when crawling though this is unlikely.

Currently my research efforts in information retrieval are over semi-structured document collections. Within our group we have been experimenting with boosts to certain structural elements and although our efforts have met slight improvements in the result rankings, there are a number of other tests to be run which I anticipate to reveal better boosting factors. The boosts thus far that we have experimented with have excluded subelement content lengths and are calculated as: sum, log(sum), 1/log(sum), avg, and no boosting. The boosting is based on a Markov Chain Model developed for Strucutural Relevance by Sadek Ali and shows great promise in using summaries.

Improving Blog Traffic October 11, 2007

Posted by shahan in Uncategorized.
add a comment

As a relatively new blogger, I’ve often wondered how I want to portray my writings and have begun to make it a higher priority over the last few weeks. One of the best things about blogging is that it is a way to hold myself accountable publicly. I’m listing a few questions and their answers for what I see VannevarVision to be.

What am I blogging about?

internet, information retrieval, online social networks, some eclipse programming

Who is my audience?

researchers or those interested in the more technical details of the topics listed

Do I want readers to keep coming back?

of course, I think I have interesting things to say

What is my target post rate?

currently at least once a week, I will get this down to once a day.

Most Importantly… What is my motivation?

I have a voice, I have a pretty good idea of what I’m talking about, I will make a change somewhere that will affect readers like you. I have valuable experiences to draw from and I’d like to be remembered amongst the archives 100 years down the road when someone is digging through trying to piece my biography together to determine what kind of foods I ate, not to mention how many beers I drank. It’d be nice in the future for my kids when they’re looking through the old-school internet and see that I was serious about my work.

Why Now?

nothing like the present, I don’t need my forebrain smacked in the form of a wakeup call

The Structure of Information Networks October 11, 2007

Posted by shahan in Uncategorized.
Tags: ,
add a comment

Jon Kleinberg is teaching a course, The Structure of Information Networks, with an interesting reading list, some of which overlaps with the required readings of Online Social Networks taught by Stefan Saroiu. Jon Kleinberg will also be giving a talk as part of U of T’s Distinguished Lecture Series on Oct 30 11AM at the Bahen Centre, Rm 1180. Other lectures are available here.

Reading List: Online Social Networks October 11, 2007

Posted by shahan in Uncategorized.

I’m duplicating the list of papers required for the Online Social Networks course. I’m no longer in the course but will continue to follow the material. The presentation I prepared on the Measurement and Analysis of Online Social Networks Presentation by Mislove et al. is attached.

* The Structure and Function of Complex Networks, M. E. J. Newman. SIAM Review 45, 167-256 (2003).
* Analysis of Topological Characteristics of Huge Online Social Networking Services, Y-Y Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. World Wide Web 2007 (WWW ’07).
* Measurement and Analysis of Online Social Networks, A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, S. Bhattacharjee. Internet Measurement Conference (IMC) 2007.
* Exploiting Social Networks for Internet Search, A. Mislove, K. P. Gummadi, and P. Druschel. HotNets 2006.
* Identity and Search in Social Networks, D. J. Watts, P. S. Dodds, M. E. J. Newman. Science 269(5571), 2002.
* On Six Degrees of Separation in DBLP-DB and More, E. Elmacioglu and D. Lee. Sigmod Record 2005.
* A Survey and Comparison of Peer-to-Peer Overlay Network Schemes, E. K. Lua, J. Crowcroft, M. Pias, R. Sharma, S. Linn. IEEE Communications Surveys and Tutorials 7(2005).
* SkipNet: A Scalable Overlay Network with Practical Locality Properties, N. J. A. Harvey, M. B. Jones, S. Saroiu, M. Theimer, A. Wolman. Usenix Symposium on Internet Technologies and Systems (USITS) 2003.
* The Impact of DHT Routing Geometry on Resilience and Proximity, K. P. Gummadi, R. Gummadi, S. D. Gribble, S. Ratnasamy, S. Shenker, I. Stoica. Sigcomm 2003.
* The Sybil Attack, J. R. Douceur, IPTPS 2002.
* Defending against Eclipse Attacks on Overlay Networks, A. Singh, M. Castro, P. Druschel, A. Rowstron. Sigops 2004.
* SybilGuard: Defending Against Sybil Attacks via Social Networks H. Yu, M. Kaminsky, P. B. Gibbons, A. Flaxman. Sigcomm 2006.
* Strength of Weak Ties, M. S. Granovetter. The American Journal of Sociology 1973.
* BubbleRap: Forwarding in small world DTNs in ever decreasing circles, P. Hui and J. Crowcroft. University of Cambridge Tech Report #UCAM-CL-TR-684 2007.
* Exploiting Social Interactions in Mobile Systems, A. G. Miklas, K. K. Gollu, K. K. W. Chan, S. Saroiu, K. P. Gummadi, E. de Lara. Ubicomp 2007.
* RE: Reliable Email, S. Garriss, M. Kaminsky, M. J. Freedman, B. Karp, D. Mazieres. Symposium on Networked Systems Design and Implementation (NSDI) 2006.
* Efficient Private Techniques for Verifying Social Proximity, M. J. Freedman and A. Nicolosi. IPTPS 2007.
* Separating key management from file system security, D. Mazieres, M. Kaminsky, M. F. Kaashoek, E. Witchel. Symposium on Operating Systems Principles (SOSP) 1999.
* Decentralized User Authentication in a Global File System., M. Kaminsky, G. Savvides, D. Mazieres, M. F. Kaashoek. Symposium on Operating Systems Principles (SOSP) 2003.
* HomeViews: Peer-to-Peer Middleware for Personal Data Sharing Applications, R. Geambasu, M. Balazinska, S. D. Gribble, and H. M. Levy. Sigmod 2007.

Demonstrating DescribeX and VisTopK at IBM CASCON Technology Showcase 2007 October 4, 2007

Posted by shahan in conference, eclipse plugin, GEF, information retrieval, visualization, XML.
Tags: , , , ,
add a comment

I’m happy to say that two projects that I work on, DescribeX (a team effort with Sadek Ali and Flavio Rizzolo) and VisTop-k, both of which are supervised by Dr. Mariano Consens, will be demonstrated at IBM’s CASCON Technology Showcase on October 22 – 25, 2007. There were quite a few interesting projects last year and I’m looking forward to seeing what new ideas have arisen, especially since my Eclipse plugin skills have increased a tremendous amount. As a student I’m also looking forward to the food 😉

Link to CASCON

Review: Exploiting Social Networks for Internet Search October 4, 2007

Posted by shahan in online social networks.
1 comment so far

Exploiting Social Networks for Internet Search, A. Mislove, K. P. Gummadi, and P. Druschel. HotNets 2006.

Of the three papers to read this week, this was by far the most interesting. Not only is it pertinent to my field of information retrieval, but it is the only one to derive results by conducting a real-world experiment. This article, which discusses a more social reason for the success in their social network search method, is in pleasant contrast to the previous required reading from Mislove et al., Measurement and Analysis of Online Social Networks, in which they conduct a battery of statistical analyses.

The focus of this paper is in the use of cached results from a connected group of individuals during their search for information. The authors demonstrate a 9% increase in the effectiveness of search results and attribute this to 3 reasons: disambiguation, ranking, and serendipity.

The paper encourages a deeper look into how large a “cluster” should be to exploit such advances in search effectiveness. In the paper’s experiment, the groups were relatively close and it will be a challenge to be able to discern groups on a larger scale especially since, as was described by Watts, 2 or 3 dimensionally independent categories are most effective in determining social relatedness. Unfortunately, the question of privacy is a very important issue and will most likely be the biggest stumbling block of putting this system into practice. This alone is a major challenge: to determine what level of social relatedness will allow someone to access a network tie’s previous searches. One possible solution is for a specialized group to offer their previous searches on a paying basis, thus becoming similar to a Google Answers system on a larger scale, a cognizant expert system if you will.

Review: On Six Degrees of Separation in DBLP-DB and More October 4, 2007

Posted by shahan in online social networks.
add a comment

Ergin Elmacioglu, Dongwon Lee, On Six Degrees of Separation in DBLP-DB and More. Sigmod Record 2005.

The authors present a standard analysis of co-authorship within the database research community from DBLP and other select venues. The collaboration network is represented by the author nodes that are incident if they co-authored one or more papers. The authors find a scale-free power law distribution in several of the statistics such as number of collaborators per author and the number of papers per author.

The paper is well-written grammatically. While several explanations are offered by the authors as to why the graphs follow these trends, explanations are missing further into the paper. On the other hand, the paper seems simply to fill the space with statistics and is missing the motivation for the work, is lacking a brief structure of the paper, and does not inspire the reader with a description of future work.

Although I am still a budding researcher, the “publish or perish” pressure stated to be a factor in the increase in collaborations is not well-supported. In my opinion, a more reasonable explanation may be that the amount of work required to research the new advanced database systems requires more authors due to each person’s specialization. Along the same lines, more interesting work can be brought about through the combination of these specializations. One measure which caught my eye was the betweenness of a node, which was possible most likely due to the smaller graph size compared to Cyworld or LiveJournal. It would certainly be interesting and useful to see this measure applied to larger systems.

Review: Identity and Search in Social Networks October 4, 2007

Posted by shahan in online social networks.
add a comment

Identity and Search in Social Networks, D. J. Watts, P. S. Dodds, M. E. J. Newman. Science 269(5571), 2002.

The authors have provided a means to predict the length of a message chain when attempting to send a message from a source to a target without having complete topological information, i.e. using only local information. With the intent of being mathematical in nature, they describe how users select their peers based on several independent dimensions, for example, geographical location or profession. The hierarchical structure of group structure is one configuration which can be used to determine relatedness. Peers who are closely related will have a shorter distance based on their lowest common ancestor.

Watts et al. have developed a formula with tuning parameters based on real world experiments dating from Milgram’s experiment that gave rise to the concept of six degrees of separation. They have convinced themselves that because the examined networks have a structure that conforms to a message length amenable to Milgram’s original experiment, then their formula satisfies the required confidence tests. One caveat they raise is that as the number of dimensions increases, the ability to describe the relationships becomes more difficult due to the decrease in correlation between network ties. However; the main benefit of this algorithm is that forwarding a message using an algorithm which brings the message closer to the destination based on a decision of 2 or 3 categories is effective and has been empirically shown.

The algorithm is indicated to be robust for a branching factor between 5 and 10 however I would describe this as being sensitive to the right degree. I would say that robustness has to do with how the algorithm performs in light of failure of part of the network, a case which they have not indicated. Recent experiments from Mislove et al. have shown that removal of the top 10% highly-connected nodes disconnects the graph into millions of small clusters; however, the formula put forth in this paper has an r value, a least probability that a message will reach its target, of 0.05, an arbitrary value which may not apply in today’s networks.