jump to navigation

The value of the semantic web. RDF$? November 6, 2007

Posted by shahan in information retrieval, internet artchitecture, online social networks, openid, semantic web, standards, Uncategorized.
add a comment

The question that this entry seeks to answer is, “Using the semantic web, what resources are available that have meaningful marketable value?”.

While the value of the semantic web has been touted, marketable value is not as widely discussed. However; in order to encourage Google to develop an OpenRDF API, they need to see what it can do for them. In my previous post about Search Standards, I mentioned measurement of a person’s search preferences, such as type of content to search and metric ranges, is key to improving results. Combining Greg Wilson’s post about Measurement with the value-of-data issues mentioned in Bob Warfield’s User-Contributed Data Auditing we now want to understand how to retrieve semantically marked-up content which has the ability to generate revenue.

User-generated semantic metrics are easily achieved with the semantic web. Further, semantic metrics can be tied together using various means, one of which is mentioned in Dan Connolly’s blog entry Units of measure and property chaining. It should be noted that, due to the extensibility of semantic data, the value or metrics are independent of any specifics, thus allowing it to be used for trust metrics as well.

There is a general use case which describes what I mean:

  1. Content is made available. The quality is not called into question, yet.
  2. The content is semantically marked up so that it has properties that mean something.
  3. Other users markup the content even further but with personally-relevant properties that can be created by themselves or using an existing schema (e.g. available from their employer) which can be associated through their online identity OpenID and can be extended with their social network through Google’s OpenSocial API.

The data has now been extended from being searchable for relevant content using existing methods to becoming searchable using user-generated value metrics. These can then be leveraged, similar to Google Coop, and with further benefit if search standards were available.

If a group was selected based on their ability to identify and rank relevant content based on not by the content contained, but by the value associated with the properties of that content, the idea of relevant content no longer becomes whether the content itself is relevant to the person evaluating it, but whether the properties would be relevant to someone searching for those properties. This potentially has the ability to remove bias from relevance evaluation. No longer is content being evaluated for what it is but what it is perceived as, and the metrics from paid users as well as the users who view the content for their own or standard metrics is easily expandable and searchable by others, an architecture permitting growth beyond limited views.


Search Standards and OpenID; not only for single sign-on, will search standards emerge? October 31, 2007

Posted by shahan in online social networks, search engines, software architecture, standards.
Tags: , ,
1 comment so far

OpenID can be the answer to a whole slew of online profile questions. Not only can it answer, “how can I sign on to all these sites using my existing profile?”, it offers the possibility of answering, “How can I search this website using my existing preferences?”.
OpenID is a single sign on architecture created by Janrain which enables users to use an existing account supporting OpenID to access other websites that also support OpenID, thereby removing the need to create separate accounts on each site. It is a secure method for passing account details from one site to the other and differs from a password manager (either software or online) that hosts your different usernames and passwords for each site. Allowing your profile to be stored and represented online, you have the ability to use your existing information quickly and easily.

Despite Stefan Brands’ in-depth analysis of the problems that may arise with OpenID, OpenID is a good solution. Not only because of the ease of authentication, but also because it’s a secure way of storing a profile online. WordPress has OpenID by default (more info here). With the number of search engines emerging that do different things with different methods, I predict the rise of search standards and profiles.

A simple definition of Search Standard: The method and the properties which enable a user to search content.

These can cover search-engine relevant properties (which can be translated into accepted user-preferences) like:

  • sources, e.g., blogs, news, static webpages
  • metric ranges, e.g., > 80% precision or recall
  • content creation date
  • last indexed or updated

This is only opening the door to many areas in search engines and associated user preferences. By having these standards, it modifies the role of the search engine from dealing with the interface and presentation to the user, to that of a web service (an actual engine) which can be exploited by combining it with other search engines. By having these preferences, it addresses one of the biggest concerns when dealing with users, understanding and identifying what they prefer. As the number of search engines increases, the search engine market will no longer be as horizontal as it has been, but will become more hierarchical as each specializes in its niche. Combinations of search parameters may prove to be beneficial as the number and type of content increases, further encouraging the divergent expression of users on the web.

Review: Exploiting Social Networks for Internet Search October 4, 2007

Posted by shahan in online social networks.
1 comment so far

Exploiting Social Networks for Internet Search, A. Mislove, K. P. Gummadi, and P. Druschel. HotNets 2006.

Of the three papers to read this week, this was by far the most interesting. Not only is it pertinent to my field of information retrieval, but it is the only one to derive results by conducting a real-world experiment. This article, which discusses a more social reason for the success in their social network search method, is in pleasant contrast to the previous required reading from Mislove et al., Measurement and Analysis of Online Social Networks, in which they conduct a battery of statistical analyses.

The focus of this paper is in the use of cached results from a connected group of individuals during their search for information. The authors demonstrate a 9% increase in the effectiveness of search results and attribute this to 3 reasons: disambiguation, ranking, and serendipity.

The paper encourages a deeper look into how large a “cluster” should be to exploit such advances in search effectiveness. In the paper’s experiment, the groups were relatively close and it will be a challenge to be able to discern groups on a larger scale especially since, as was described by Watts, 2 or 3 dimensionally independent categories are most effective in determining social relatedness. Unfortunately, the question of privacy is a very important issue and will most likely be the biggest stumbling block of putting this system into practice. This alone is a major challenge: to determine what level of social relatedness will allow someone to access a network tie’s previous searches. One possible solution is for a specialized group to offer their previous searches on a paying basis, thus becoming similar to a Google Answers system on a larger scale, a cognizant expert system if you will.

Review: On Six Degrees of Separation in DBLP-DB and More October 4, 2007

Posted by shahan in online social networks.
add a comment

Ergin Elmacioglu, Dongwon Lee, On Six Degrees of Separation in DBLP-DB and More. Sigmod Record 2005.

The authors present a standard analysis of co-authorship within the database research community from DBLP and other select venues. The collaboration network is represented by the author nodes that are incident if they co-authored one or more papers. The authors find a scale-free power law distribution in several of the statistics such as number of collaborators per author and the number of papers per author.

The paper is well-written grammatically. While several explanations are offered by the authors as to why the graphs follow these trends, explanations are missing further into the paper. On the other hand, the paper seems simply to fill the space with statistics and is missing the motivation for the work, is lacking a brief structure of the paper, and does not inspire the reader with a description of future work.

Although I am still a budding researcher, the “publish or perish” pressure stated to be a factor in the increase in collaborations is not well-supported. In my opinion, a more reasonable explanation may be that the amount of work required to research the new advanced database systems requires more authors due to each person’s specialization. Along the same lines, more interesting work can be brought about through the combination of these specializations. One measure which caught my eye was the betweenness of a node, which was possible most likely due to the smaller graph size compared to Cyworld or LiveJournal. It would certainly be interesting and useful to see this measure applied to larger systems.

Review: Identity and Search in Social Networks October 4, 2007

Posted by shahan in online social networks.
add a comment

Identity and Search in Social Networks, D. J. Watts, P. S. Dodds, M. E. J. Newman. Science 269(5571), 2002.

The authors have provided a means to predict the length of a message chain when attempting to send a message from a source to a target without having complete topological information, i.e. using only local information. With the intent of being mathematical in nature, they describe how users select their peers based on several independent dimensions, for example, geographical location or profession. The hierarchical structure of group structure is one configuration which can be used to determine relatedness. Peers who are closely related will have a shorter distance based on their lowest common ancestor.

Watts et al. have developed a formula with tuning parameters based on real world experiments dating from Milgram’s experiment that gave rise to the concept of six degrees of separation. They have convinced themselves that because the examined networks have a structure that conforms to a message length amenable to Milgram’s original experiment, then their formula satisfies the required confidence tests. One caveat they raise is that as the number of dimensions increases, the ability to describe the relationships becomes more difficult due to the decrease in correlation between network ties. However; the main benefit of this algorithm is that forwarding a message using an algorithm which brings the message closer to the destination based on a decision of 2 or 3 categories is effective and has been empirically shown.

The algorithm is indicated to be robust for a branching factor between 5 and 10 however I would describe this as being sensitive to the right degree. I would say that robustness has to do with how the algorithm performs in light of failure of part of the network, a case which they have not indicated. Recent experiments from Mislove et al. have shown that removal of the top 10% highly-connected nodes disconnects the graph into millions of small clusters; however, the formula put forth in this paper has an r value, a least probability that a message will reach its target, of 0.05, an arbitrary value which may not apply in today’s networks.

Review: Measurement and Analysis of Online Social Networks October 4, 2007

Posted by shahan in online social networks.
add a comment

Measurement and Analysis of Online Social Networks, A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, S. Bhattacharjee. Internet Measurement Conference (IMC) 2007.

An in-depth study of online social network graphs is performed for Flickr, LiveJournal, Orkut, and YouTube. Typical graph properties are indicated and compared amongst each other such as: power-law coefficients, degree correlation, link symmetry, paths, and fringe clusters, as well as some brief analysis on groups. The authors indicate that this information will be useful to enhance existing and develop new applications and algorithms by comparing them to the web. There is a lack of practical application of the results however with little or no future direction indicated.

Amongst the multitude of values presented, I found the link symmetry to be one of the most interesting portions of the paper. While the authors speculate that core nodes with a high degree can be useful in the transmission of information, they can also be detrimental in the case of spam or viruses. On the web, PageRank considers pages with many incoming links and few outgoing links to be authoritative and a source of information. Conversely, pages with many outgoing links with few incoming links are considered active and are not sources of information. Using this type of model allows PageRank to effectively identify pages that contain useful information; however, due to link symmetry in an online social network, this does not apply.

A suggestion would be to examine the destination of a link request such that if user A request a connection to user B first, then user B can be deemed more important. This would be useful in the case of YouTube, where the average indegree of friends connected to nodes of high outdegree is low, thus the “celebrity-driven” nature of the content (Figure 6). Considering the link destination as being more valuable is one way to offset the symmetric nature of links. This model is akin to voting for someone when you request a link to them, your vote being acknowledged through link reciprocation. In the case of lack of temporal information, we can infer more important nodes as we know the resultant outdegree versus average indegree of friends graph . A benefit of having the temporal information is also to quash the potential infiltrator who wishes to spread a virus, thus instead of allowing a node to become part of a hub by having many links, they would be trusted only if many other nodes have a desire to connect to them first.

Despite several well-explained graphs of differences due to snowball sampling or link caps, there is a lack of conclusion in the paper. It sets the stage very well for the exploration of questions and answers, with the data itself available for download, for better solutions for online social network-based trust and information retrieval techniques.

Review: Analysis of Topological Characteristics of Huge Online Social Networking Services October 4, 2007

Posted by shahan in online social networks.
add a comment

Analysis of Topological Characteristics of Huge Online Social Networking Services, Y-Y Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. World Wide Web 2007 (WWW ’07).

The authors present the main characteristics and evolution of online social networks while providing supporting research on how representative a sample is. Complete data from Cyworld is compared and contrasted with samples from MySpace and Orkut to show that while certain properties are similar to that of offline social networks, interesting properties are revealed which are unique to online social networks.

One of the most important concerns in selecting a sample is validating how closely it represents the complete network. In the case of Orkut, the authors found a power law exponent of 3.7; however, this contradicts the exponent of 1.5 found by Mislove et al. Furthermore, each paper has differing references for the effect of snowball sampling on the power-law exponent; Mislove et al state “collecting samples via the snowball method has been shown to underestimate the power-law coefficient [1]” while conversely Ahn et al state “It is known that the power-law nature in the degree distribution is well conserved under snowball sampling [2]”. Having small sample sizes of 100,000 nodes further exacerbates this issue.

Interestingly, the historical analysis reveals the multi-scaled degree-distribution emerging in mid 2004. While the authors were unable to accurately indicate the reason for this, an Aug 2005 report from SK Telecommunications (Cyworld’s owner) website [3] reveals that Mobile Cyworld was released around March 2004. Additionally, there is inconclusive proof of this using alexa.com’s pageview statistics of cyworld.com, showing a sudden drop in late 2004, possibly contributed to Mobile Cyworld. Furthermore, a previous community portal site, Freechal, began to charge for membership around the same time period, encouraging users to switch to the more interactive Cyworld, ironically, with probably higher cost due to the addictive nature of acorns [4].

While the data and the research methodology presented in the paper may not be completely accurate, being one of the first attempts to provide metrics for online social networks places a high value on this work.

[1] L. Becchetti et al, A Comparison of Sampling Techniques for Web Graph Characterization. In Proceedings of the Workshop on Link Analysis, (LinkKDD’06), Philadelphia, PA, Aug 2006.

[2] S. H. Lee et al, Statistical properties of sampled networks. Phys. Rev. E, 73:016102, 2006.

[3] http://www.sktelecom.com/eng/jsp/tlounge/presscenter/PressCenterDetail.jsp?f_reportdata_seq=3668, SKT introduces WAP version of Mobile Cyworld (accessible only through search feature on homepage).

[4] http://english.ohmynews.com/articleview/article_view.asp?no=179108&rel_no=1