jump to navigation

Search Standards and OpenID; not only for single sign-on, will search standards emerge? October 31, 2007

Posted by shahan in online social networks, search engines, software architecture, standards.
Tags: , ,
1 comment so far

OpenID can be the answer to a whole slew of online profile questions. Not only can it answer, “how can I sign on to all these sites using my existing profile?”, it offers the possibility of answering, “How can I search this website using my existing preferences?”.
OpenID is a single sign on architecture created by Janrain which enables users to use an existing account supporting OpenID to access other websites that also support OpenID, thereby removing the need to create separate accounts on each site. It is a secure method for passing account details from one site to the other and differs from a password manager (either software or online) that hosts your different usernames and passwords for each site. Allowing your profile to be stored and represented online, you have the ability to use your existing information quickly and easily.

Despite Stefan Brands’ in-depth analysis of the problems that may arise with OpenID, OpenID is a good solution. Not only because of the ease of authentication, but also because it’s a secure way of storing a profile online. WordPress has OpenID by default (more info here). With the number of search engines emerging that do different things with different methods, I predict the rise of search standards and profiles.

A simple definition of Search Standard: The method and the properties which enable a user to search content.

These can cover search-engine relevant properties (which can be translated into accepted user-preferences) like:

  • sources, e.g., blogs, news, static webpages
  • metric ranges, e.g., > 80% precision or recall
  • content creation date
  • last indexed or updated

This is only opening the door to many areas in search engines and associated user preferences. By having these standards, it modifies the role of the search engine from dealing with the interface and presentation to the user, to that of a web service (an actual engine) which can be exploited by combining it with other search engines. By having these preferences, it addresses one of the biggest concerns when dealing with users, understanding and identifying what they prefer. As the number of search engines increases, the search engine market will no longer be as horizontal as it has been, but will become more hierarchical as each specializes in its niche. Combinations of search parameters may prove to be beneficial as the number and type of content increases, further encouraging the divergent expression of users on the web.


The Structure of Information Networks October 11, 2007

Posted by shahan in Uncategorized.
Tags: ,
add a comment

Jon Kleinberg is teaching a course, The Structure of Information Networks, with an interesting reading list, some of which overlaps with the required readings of Online Social Networks taught by Stefan Saroiu. Jon Kleinberg will also be giving a talk as part of U of T’s Distinguished Lecture Series on Oct 30 11AM at the Bahen Centre, Rm 1180. Other lectures are available here.

Reading List: Online Social Networks October 11, 2007

Posted by shahan in Uncategorized.

I’m duplicating the list of papers required for the Online Social Networks course. I’m no longer in the course but will continue to follow the material. The presentation I prepared on the Measurement and Analysis of Online Social Networks Presentation by Mislove et al. is attached.

* The Structure and Function of Complex Networks, M. E. J. Newman. SIAM Review 45, 167-256 (2003).
* Analysis of Topological Characteristics of Huge Online Social Networking Services, Y-Y Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. World Wide Web 2007 (WWW ’07).
* Measurement and Analysis of Online Social Networks, A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, S. Bhattacharjee. Internet Measurement Conference (IMC) 2007.
* Exploiting Social Networks for Internet Search, A. Mislove, K. P. Gummadi, and P. Druschel. HotNets 2006.
* Identity and Search in Social Networks, D. J. Watts, P. S. Dodds, M. E. J. Newman. Science 269(5571), 2002.
* On Six Degrees of Separation in DBLP-DB and More, E. Elmacioglu and D. Lee. Sigmod Record 2005.
* A Survey and Comparison of Peer-to-Peer Overlay Network Schemes, E. K. Lua, J. Crowcroft, M. Pias, R. Sharma, S. Linn. IEEE Communications Surveys and Tutorials 7(2005).
* SkipNet: A Scalable Overlay Network with Practical Locality Properties, N. J. A. Harvey, M. B. Jones, S. Saroiu, M. Theimer, A. Wolman. Usenix Symposium on Internet Technologies and Systems (USITS) 2003.
* The Impact of DHT Routing Geometry on Resilience and Proximity, K. P. Gummadi, R. Gummadi, S. D. Gribble, S. Ratnasamy, S. Shenker, I. Stoica. Sigcomm 2003.
* The Sybil Attack, J. R. Douceur, IPTPS 2002.
* Defending against Eclipse Attacks on Overlay Networks, A. Singh, M. Castro, P. Druschel, A. Rowstron. Sigops 2004.
* SybilGuard: Defending Against Sybil Attacks via Social Networks H. Yu, M. Kaminsky, P. B. Gibbons, A. Flaxman. Sigcomm 2006.
* Strength of Weak Ties, M. S. Granovetter. The American Journal of Sociology 1973.
* BubbleRap: Forwarding in small world DTNs in ever decreasing circles, P. Hui and J. Crowcroft. University of Cambridge Tech Report #UCAM-CL-TR-684 2007.
* Exploiting Social Interactions in Mobile Systems, A. G. Miklas, K. K. Gollu, K. K. W. Chan, S. Saroiu, K. P. Gummadi, E. de Lara. Ubicomp 2007.
* RE: Reliable Email, S. Garriss, M. Kaminsky, M. J. Freedman, B. Karp, D. Mazieres. Symposium on Networked Systems Design and Implementation (NSDI) 2006.
* Efficient Private Techniques for Verifying Social Proximity, M. J. Freedman and A. Nicolosi. IPTPS 2007.
* Separating key management from file system security, D. Mazieres, M. Kaminsky, M. F. Kaashoek, E. Witchel. Symposium on Operating Systems Principles (SOSP) 1999.
* Decentralized User Authentication in a Global File System., M. Kaminsky, G. Savvides, D. Mazieres, M. F. Kaashoek. Symposium on Operating Systems Principles (SOSP) 2003.
* HomeViews: Peer-to-Peer Middleware for Personal Data Sharing Applications, R. Geambasu, M. Balazinska, S. D. Gribble, and H. M. Levy. Sigmod 2007.

Review: Exploiting Social Networks for Internet Search October 4, 2007

Posted by shahan in online social networks.
1 comment so far

Exploiting Social Networks for Internet Search, A. Mislove, K. P. Gummadi, and P. Druschel. HotNets 2006.

Of the three papers to read this week, this was by far the most interesting. Not only is it pertinent to my field of information retrieval, but it is the only one to derive results by conducting a real-world experiment. This article, which discusses a more social reason for the success in their social network search method, is in pleasant contrast to the previous required reading from Mislove et al., Measurement and Analysis of Online Social Networks, in which they conduct a battery of statistical analyses.

The focus of this paper is in the use of cached results from a connected group of individuals during their search for information. The authors demonstrate a 9% increase in the effectiveness of search results and attribute this to 3 reasons: disambiguation, ranking, and serendipity.

The paper encourages a deeper look into how large a “cluster” should be to exploit such advances in search effectiveness. In the paper’s experiment, the groups were relatively close and it will be a challenge to be able to discern groups on a larger scale especially since, as was described by Watts, 2 or 3 dimensionally independent categories are most effective in determining social relatedness. Unfortunately, the question of privacy is a very important issue and will most likely be the biggest stumbling block of putting this system into practice. This alone is a major challenge: to determine what level of social relatedness will allow someone to access a network tie’s previous searches. One possible solution is for a specialized group to offer their previous searches on a paying basis, thus becoming similar to a Google Answers system on a larger scale, a cognizant expert system if you will.

Review: On Six Degrees of Separation in DBLP-DB and More October 4, 2007

Posted by shahan in online social networks.
add a comment

Ergin Elmacioglu, Dongwon Lee, On Six Degrees of Separation in DBLP-DB and More. Sigmod Record 2005.

The authors present a standard analysis of co-authorship within the database research community from DBLP and other select venues. The collaboration network is represented by the author nodes that are incident if they co-authored one or more papers. The authors find a scale-free power law distribution in several of the statistics such as number of collaborators per author and the number of papers per author.

The paper is well-written grammatically. While several explanations are offered by the authors as to why the graphs follow these trends, explanations are missing further into the paper. On the other hand, the paper seems simply to fill the space with statistics and is missing the motivation for the work, is lacking a brief structure of the paper, and does not inspire the reader with a description of future work.

Although I am still a budding researcher, the “publish or perish” pressure stated to be a factor in the increase in collaborations is not well-supported. In my opinion, a more reasonable explanation may be that the amount of work required to research the new advanced database systems requires more authors due to each person’s specialization. Along the same lines, more interesting work can be brought about through the combination of these specializations. One measure which caught my eye was the betweenness of a node, which was possible most likely due to the smaller graph size compared to Cyworld or LiveJournal. It would certainly be interesting and useful to see this measure applied to larger systems.

Review: Identity and Search in Social Networks October 4, 2007

Posted by shahan in online social networks.
add a comment

Identity and Search in Social Networks, D. J. Watts, P. S. Dodds, M. E. J. Newman. Science 269(5571), 2002.

The authors have provided a means to predict the length of a message chain when attempting to send a message from a source to a target without having complete topological information, i.e. using only local information. With the intent of being mathematical in nature, they describe how users select their peers based on several independent dimensions, for example, geographical location or profession. The hierarchical structure of group structure is one configuration which can be used to determine relatedness. Peers who are closely related will have a shorter distance based on their lowest common ancestor.

Watts et al. have developed a formula with tuning parameters based on real world experiments dating from Milgram’s experiment that gave rise to the concept of six degrees of separation. They have convinced themselves that because the examined networks have a structure that conforms to a message length amenable to Milgram’s original experiment, then their formula satisfies the required confidence tests. One caveat they raise is that as the number of dimensions increases, the ability to describe the relationships becomes more difficult due to the decrease in correlation between network ties. However; the main benefit of this algorithm is that forwarding a message using an algorithm which brings the message closer to the destination based on a decision of 2 or 3 categories is effective and has been empirically shown.

The algorithm is indicated to be robust for a branching factor between 5 and 10 however I would describe this as being sensitive to the right degree. I would say that robustness has to do with how the algorithm performs in light of failure of part of the network, a case which they have not indicated. Recent experiments from Mislove et al. have shown that removal of the top 10% highly-connected nodes disconnects the graph into millions of small clusters; however, the formula put forth in this paper has an r value, a least probability that a message will reach its target, of 0.05, an arbitrary value which may not apply in today’s networks.

Review: Measurement and Analysis of Online Social Networks October 4, 2007

Posted by shahan in online social networks.
add a comment

Measurement and Analysis of Online Social Networks, A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, S. Bhattacharjee. Internet Measurement Conference (IMC) 2007.

An in-depth study of online social network graphs is performed for Flickr, LiveJournal, Orkut, and YouTube. Typical graph properties are indicated and compared amongst each other such as: power-law coefficients, degree correlation, link symmetry, paths, and fringe clusters, as well as some brief analysis on groups. The authors indicate that this information will be useful to enhance existing and develop new applications and algorithms by comparing them to the web. There is a lack of practical application of the results however with little or no future direction indicated.

Amongst the multitude of values presented, I found the link symmetry to be one of the most interesting portions of the paper. While the authors speculate that core nodes with a high degree can be useful in the transmission of information, they can also be detrimental in the case of spam or viruses. On the web, PageRank considers pages with many incoming links and few outgoing links to be authoritative and a source of information. Conversely, pages with many outgoing links with few incoming links are considered active and are not sources of information. Using this type of model allows PageRank to effectively identify pages that contain useful information; however, due to link symmetry in an online social network, this does not apply.

A suggestion would be to examine the destination of a link request such that if user A request a connection to user B first, then user B can be deemed more important. This would be useful in the case of YouTube, where the average indegree of friends connected to nodes of high outdegree is low, thus the “celebrity-driven” nature of the content (Figure 6). Considering the link destination as being more valuable is one way to offset the symmetric nature of links. This model is akin to voting for someone when you request a link to them, your vote being acknowledged through link reciprocation. In the case of lack of temporal information, we can infer more important nodes as we know the resultant outdegree versus average indegree of friends graph . A benefit of having the temporal information is also to quash the potential infiltrator who wishes to spread a virus, thus instead of allowing a node to become part of a hub by having many links, they would be trusted only if many other nodes have a desire to connect to them first.

Despite several well-explained graphs of differences due to snowball sampling or link caps, there is a lack of conclusion in the paper. It sets the stage very well for the exploration of questions and answers, with the data itself available for download, for better solutions for online social network-based trust and information retrieval techniques.

Review: Analysis of Topological Characteristics of Huge Online Social Networking Services October 4, 2007

Posted by shahan in online social networks.
add a comment

Analysis of Topological Characteristics of Huge Online Social Networking Services, Y-Y Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. World Wide Web 2007 (WWW ’07).

The authors present the main characteristics and evolution of online social networks while providing supporting research on how representative a sample is. Complete data from Cyworld is compared and contrasted with samples from MySpace and Orkut to show that while certain properties are similar to that of offline social networks, interesting properties are revealed which are unique to online social networks.

One of the most important concerns in selecting a sample is validating how closely it represents the complete network. In the case of Orkut, the authors found a power law exponent of 3.7; however, this contradicts the exponent of 1.5 found by Mislove et al. Furthermore, each paper has differing references for the effect of snowball sampling on the power-law exponent; Mislove et al state “collecting samples via the snowball method has been shown to underestimate the power-law coefficient [1]” while conversely Ahn et al state “It is known that the power-law nature in the degree distribution is well conserved under snowball sampling [2]”. Having small sample sizes of 100,000 nodes further exacerbates this issue.

Interestingly, the historical analysis reveals the multi-scaled degree-distribution emerging in mid 2004. While the authors were unable to accurately indicate the reason for this, an Aug 2005 report from SK Telecommunications (Cyworld’s owner) website [3] reveals that Mobile Cyworld was released around March 2004. Additionally, there is inconclusive proof of this using alexa.com’s pageview statistics of cyworld.com, showing a sudden drop in late 2004, possibly contributed to Mobile Cyworld. Furthermore, a previous community portal site, Freechal, began to charge for membership around the same time period, encouraging users to switch to the more interactive Cyworld, ironically, with probably higher cost due to the addictive nature of acorns [4].

While the data and the research methodology presented in the paper may not be completely accurate, being one of the first attempts to provide metrics for online social networks places a high value on this work.

[1] L. Becchetti et al, A Comparison of Sampling Techniques for Web Graph Characterization. In Proceedings of the Workshop on Link Analysis, (LinkKDD’06), Philadelphia, PA, Aug 2006.

[2] S. H. Lee et al, Statistical properties of sampled networks. Phys. Rev. E, 73:016102, 2006.

[3] http://www.sktelecom.com/eng/jsp/tlounge/presscenter/PressCenterDetail.jsp?f_reportdata_seq=3668, SKT introduces WAP version of Mobile Cyworld (accessible only through search feature on homepage).

[4] http://english.ohmynews.com/articleview/article_view.asp?no=179108&rel_no=1