jump to navigation

Academic CV Template March 25, 2010

Posted by shahan in Uncategorized.
Tags: , , ,
add a comment

After searching for longer than I would have liked to, I found a good Tex/Latex template for academic CVs at the following link:

http://jblevins.org/projects/cv-template/

Globalive and what net-neutrality is and isn’t December 13, 2009

Posted by shahan in Uncategorized.
Tags: ,
add a comment

http://www.cbc.ca/technology/story/2009/12/11/clement-internet-access-bell-telus-mts.html

The story here is that the smaller internet providers won’t have access to the newer/faster connections that Bell and Telus setup. Also that internet access is not yet deemed an essential service.

Basically, there’s a lot of brewhaha about what net-neutrality really is. Financial issues are really the side-dish of the whole meal.
Net-neutrality is really about giving equal access to the material on the internet, i.e., not limiting P2P transfers, or skype calls.
The issue is not about the _physical_ connection nor about the cost of that access. The biggest earner for the internet provider is not the fastest (and therefor most expensive) connection, it’s the general user who pays for a reasonably fast connection then simply sends email and visits a few sites. Many reports already find Canada way behind the times with its internet costing and technology policies. The end-result in my opinion is, as more wireless providers enter the market (“more” means that an increase from 3 national providers to 4 with Globalive’s entrance which also relies on Rogers’ network), then internet providers will/should move to wireless technologies. This I think is an important step especially for a large land mass like Canada where physical connections as offered by Bell and Telus should take the back-seat.

Facebook Continues Unconsented Invasion of Privacy December 13, 2009

Posted by shahan in Uncategorized.
Tags: ,
add a comment

http://www.cbc.ca/technology/story/2009/12/08/tech-facebook-cellphone-synchronize-privacy.html

Even though I’m not on facebook for privacy reasons, there is still an inherent flaw that facebook collects info on non-users. It’s done by installing the facebook application on the blackberry which then gives the owner the choice of storing the blackberry contact list on facebook.

Also, recently facebook upgraded its privacy controls to allow people posting content to control who sees each and every piece of content – but if you noticed, the wizard’s default is to open all the info to everyone again! I was so close to signing up and then poof, they try to pull another one over people’s eyes. Can’t wait for a truly open-source social network solution to exist.

I also can’t wait till google voice comes up with its service in canada where disposable numbers and email address will become the norm! I really have to be ghost in the meantime.

ICDE 2008 DescribeX Demonstration April 10, 2008

Posted by shahan in Uncategorized.
Tags: , , , , , , , , ,
add a comment

This post is an outline of the DescribeX and the demonstration at ICDE 2008. The 4-page demonstration submission will be available soon.

UPDATE: The submission is available online here.

DescribeX is a graphical Eclipse plugin for interacting with structural summaries of XML collections. It is developed in Java using GEF, Zest (now incorporated into GEF), Brics (a Java automaton library), and Apache Lucene (a Java information retrieval library). The structural summaries are defined using an axis path regular expression (AxPRE).

Several versions have been developed, each new version allowing a different type of summary as well as different interactions with the summary.

The oldest version, originally developed for Cascon 2006, created a P* summary (or F&B-Index) and thus created the structural summary as a tree. A tree graph layout algorithm from GEF was used. Only a P*C refinement was available using XPath expressions evaluated against all the files in the collection. The control panel for this version is on the bottom on the left.

The second version allowed the creation of an A(k)-index, allowing the user to specify the height in the path for which to consider when creating the summary partitions. This used Zest (now incorporated into GEF) for the layout algorithm due since a structural summary based on the A(k)-index can create a graph instead of a tree.

The third version implements the true AxPRE expressions, using the Brics automaton Java library for converting the regular expression to a NFA. A label summary was created of the collection and refinements were processed by intersecting the NFA of the regular expression with the automaton of the label summary. Zest was also used for the layout algorithm. The control panel for this version is in the middle on the left side.

The differences between the versions are in the extra features such as the additional filters such as coverage and highlighting elements from a keyword query.

The key points of the demonstration are that our tool allows a user to quickly and easily determine the paths that exist in the collection, determine the importance of summary nodes, as well as interact with the structural summary by performing refinements. An additional aspect is the ability to highlight the elements that contain the terms in keyword search, this is in relation to our participation in INEX.

The attached screenshot shows three graphs, the topmost and middle graphs are P* structural summaries (or F&B-indexes) of two protein-protein interaction (PPI) datasets conforming to the PSI-MI schema standard. These two graphs are based on the first version and shows the important nodes coloured green using a coverage value of 50%, i.e. showing the nodes that together contain 50% of the entire collection’s total number of extents. Other coverage measures are easily available (such as a random walk coverage) and easily implementable. The first (topmost) dataset, HPRD, is a single 60MB XML file while the second (middle) dataset, Intact, is a collection of 6 XML files totalling 20MB. It should be noted that these are only a small subset of the gigabyte size collections available. We can see that the structure of the larger HPRD collection has a smaller structure in use than the Intact collection.

I obtained some very good feedback after demonstrating DescribeX to several of the attendees. Some of the feedback included displaying cardinalities as well as displaying the information retrieval component while using summaries. It would have been nice to show how the scoring of a document would have been affected if some of the summary nodes were refined using an AxPRE to combine elements containing the search term. Next time I hope to allow the user to use the plugin to prod the product, “It’s like walking the high wire without a safety net” as Guy Lohman put it.

Future work involves preparing a downloadable plugin for interested users. As it stands, the three versions can be made available and can work alongside each other (and actually the third version requires the first version); however, the instructions for use have not been updated in a while (though the application is easy to use). There is also a lack of extensibility of the newer version since I would like to update the way in which the extension point for filters and coverage are implemented.

Overview Screenshot of DescribeX Demonstration at ICDE 2008 Cancun

The value of the semantic web. RDF$? November 6, 2007

Posted by shahan in information retrieval, internet artchitecture, online social networks, openid, semantic web, standards, Uncategorized.
add a comment

The question that this entry seeks to answer is, “Using the semantic web, what resources are available that have meaningful marketable value?”.

While the value of the semantic web has been touted, marketable value is not as widely discussed. However; in order to encourage Google to develop an OpenRDF API, they need to see what it can do for them. In my previous post about Search Standards, I mentioned measurement of a person’s search preferences, such as type of content to search and metric ranges, is key to improving results. Combining Greg Wilson’s post about Measurement with the value-of-data issues mentioned in Bob Warfield’s User-Contributed Data Auditing we now want to understand how to retrieve semantically marked-up content which has the ability to generate revenue.

User-generated semantic metrics are easily achieved with the semantic web. Further, semantic metrics can be tied together using various means, one of which is mentioned in Dan Connolly’s blog entry Units of measure and property chaining. It should be noted that, due to the extensibility of semantic data, the value or metrics are independent of any specifics, thus allowing it to be used for trust metrics as well.

There is a general use case which describes what I mean:

  1. Content is made available. The quality is not called into question, yet.
  2. The content is semantically marked up so that it has properties that mean something.
  3. Other users markup the content even further but with personally-relevant properties that can be created by themselves or using an existing schema (e.g. available from their employer) which can be associated through their online identity OpenID and can be extended with their social network through Google’s OpenSocial API.

The data has now been extended from being searchable for relevant content using existing methods to becoming searchable using user-generated value metrics. These can then be leveraged, similar to Google Coop, and with further benefit if search standards were available.

If a group was selected based on their ability to identify and rank relevant content based on not by the content contained, but by the value associated with the properties of that content, the idea of relevant content no longer becomes whether the content itself is relevant to the person evaluating it, but whether the properties would be relevant to someone searching for those properties. This potentially has the ability to remove bias from relevance evaluation. No longer is content being evaluated for what it is but what it is perceived as, and the metrics from paid users as well as the users who view the content for their own or standard metrics is easily expandable and searchable by others, an architecture permitting growth beyond limited views.

Want to comment on Tim Berners-Lee’s blog? Here’s how November 2, 2007

Posted by shahan in openid, semantic web.
3 comments

It’s very easy. The Decentralized Information Group (DIG) is where you can find a bit of information on what’s being rolled out regarding the combined use of rdf and openid and is also host to several blogs. In order to comment, wise techniques have been implemented to block spammers through the use of openid, rdf, and a basic trust metric. Before someone can login to post, the person must be placed on a whitelist. You cannot create an account on the site; openid is used to login. To compute the basic trust metric of being known within 2 degrees of separation (a person at DIG knows someone who knows someone), you require a FOAF file. The following is a list of steps I took to get whitelisted:

1. WordPress provides an openid url for me, it’s the address of my blog; http://vannevarvision.wordpress.com

2. I generated a FOAF file through the FOAF-a-matic.

3. I copied and pasted the generated rdf from step-2 into a text file called foaf.rdf, and added the line

<foaf:openid rdf:resource="http://vannevarvision.wordpress.com/"/>

before the line

</foaf:Person>

NOTE: this requirement may be removed in the future to use the homepage property instead of the openid property

4. I saved the file, uploaded it to my homepage, and to ensure that Apache Web Server would provide the correct content-type for the rdf file, I added the following line to my .htaccess file:

AddType application/rdf+xml rdf

5. I joined the Semantic Web Interest Group’s IRC channel, where I asked whether anyone would be kind enough to add me to their ‘knows’ list in their own FOAF properties.

6. Sean B. Palmer(sbp) and Dan Connolly (DanC) were kind enough to look at my blog to see that I don’t have spammer intentions so Sean added me to his FOAF, validated it, then reran the script on the blog server to add me to the whitelist.

7. I’m now able to login to the DIG site using my openid url

It was a very easy and quick process though I had the advantage of a blog dating from last year with a few posts on XML and microformats, not entirely out of scope from the semantic web community. Thanks to sbp and DanC for their help.

Recommended References:

FOAF and OpenID: two great tastes that taste great together by Dan Connolly

Whitelisting blog post by Sean B. Palmer

XML Structural Summaries and Microformats October 31, 2007

Posted by shahan in eclipse plugin, information retrieval, search engines, software architecture, software development, visualization, XML.
add a comment

From my experiences attempting to integrate microformats into XML structural summaries, the results have all been workarounds.

Microformats are integrated into an XHTML page through the ‘class’ attribute of an element. I won’t go into the issues with doing this and while the additional information embedded into the page is welcome, it doesn’t conform to the standardized integration model offered by XML. A good reference on integrating and pulling microformat information from a page is here.

Microformats are not easily retrieved from a page because there is no way to know ahead of time what formats are integrated into the page. A workaround in creating an XML structural summary based on microformats can be obtained by applying an extension of the XML element model to indexing attributes and furthermore their values (in order to identify differing attributes). Since the structural summaries being developed using AxPREs are based on XPath expressions, they will be able to handle microformats but with advanced planning on the user.

The screenshot below is of DescribeX with a P* summary of a collection of hCalendar files. Using Apache Lucene, the files are indexed to include regular text token, XML elements, XML attributes and their associatd values. On the right-hand side you can see a query has been entered searching using Lucene’s default regex ‘*event*’ to search for ‘class’ attributes that contain that term. The vertices in red represent the elements which contain it and while it would be nice to assume that the descendants of the highlighted vertices are related to hCalendar events, it is not the case.

Microformat highlighting using DescribeX

Search Standards and OpenID; not only for single sign-on, will search standards emerge? October 31, 2007

Posted by shahan in online social networks, search engines, software architecture, standards.
Tags: , ,
1 comment so far

OpenID can be the answer to a whole slew of online profile questions. Not only can it answer, “how can I sign on to all these sites using my existing profile?”, it offers the possibility of answering, “How can I search this website using my existing preferences?”.
OpenID is a single sign on architecture created by Janrain which enables users to use an existing account supporting OpenID to access other websites that also support OpenID, thereby removing the need to create separate accounts on each site. It is a secure method for passing account details from one site to the other and differs from a password manager (either software or online) that hosts your different usernames and passwords for each site. Allowing your profile to be stored and represented online, you have the ability to use your existing information quickly and easily.

Despite Stefan Brands’ in-depth analysis of the problems that may arise with OpenID, OpenID is a good solution. Not only because of the ease of authentication, but also because it’s a secure way of storing a profile online. WordPress has OpenID by default (more info here). With the number of search engines emerging that do different things with different methods, I predict the rise of search standards and profiles.

A simple definition of Search Standard: The method and the properties which enable a user to search content.

These can cover search-engine relevant properties (which can be translated into accepted user-preferences) like:

  • sources, e.g., blogs, news, static webpages
  • metric ranges, e.g., > 80% precision or recall
  • content creation date
  • last indexed or updated

This is only opening the door to many areas in search engines and associated user preferences. By having these standards, it modifies the role of the search engine from dealing with the interface and presentation to the user, to that of a web service (an actual engine) which can be exploited by combining it with other search engines. By having these preferences, it addresses one of the biggest concerns when dealing with users, understanding and identifying what they prefer. As the number of search engines increases, the search engine market will no longer be as horizontal as it has been, but will become more hierarchical as each specializes in its niche. Combinations of search parameters may prove to be beneficial as the number and type of content increases, further encouraging the divergent expression of users on the web.

Alternative Search Engines October 13, 2007

Posted by shahan in Uncategorized.
Tags: , ,
add a comment

In response to WebWorkerDaily’s article, none of the search engines listed include retrieval using structured information. Although I’m involved with information retrieval as part of my research, I don’t spend a lot of time exploring the search engines “out there”. The only reason I can give is that they haven’t done for me what Google already does with a little bit of query creativity. While searching news or blogs may have the benefit of limited scope, there’s no demonstration of added benefit.

A consequence of limiting search to a niche is that the popular terms within that niche become “boosted” automatically without being subsumed, e.g., by a larger news service or certain wiki. Another is that the rate of re-crawling already indexed pages can be better managed. I’ll make it a point to explore whether these search engines examine markup on the page when crawling though this is unlikely.

Currently my research efforts in information retrieval are over semi-structured document collections. Within our group we have been experimenting with boosts to certain structural elements and although our efforts have met slight improvements in the result rankings, there are a number of other tests to be run which I anticipate to reveal better boosting factors. The boosts thus far that we have experimented with have excluded subelement content lengths and are calculated as: sum, log(sum), 1/log(sum), avg, and no boosting. The boosting is based on a Markov Chain Model developed for Strucutural Relevance by Sadek Ali and shows great promise in using summaries.

Improving Blog Traffic October 11, 2007

Posted by shahan in Uncategorized.
Tags:
add a comment

As a relatively new blogger, I’ve often wondered how I want to portray my writings and have begun to make it a higher priority over the last few weeks. One of the best things about blogging is that it is a way to hold myself accountable publicly. I’m listing a few questions and their answers for what I see VannevarVision to be.

What am I blogging about?

internet, information retrieval, online social networks, some eclipse programming

Who is my audience?

researchers or those interested in the more technical details of the topics listed

Do I want readers to keep coming back?

of course, I think I have interesting things to say

What is my target post rate?

currently at least once a week, I will get this down to once a day.

Most Importantly… What is my motivation?

I have a voice, I have a pretty good idea of what I’m talking about, I will make a change somewhere that will affect readers like you. I have valuable experiences to draw from and I’d like to be remembered amongst the archives 100 years down the road when someone is digging through trying to piece my biography together to determine what kind of foods I ate, not to mention how many beers I drank. It’d be nice in the future for my kids when they’re looking through the old-school internet and see that I was serious about my work.

Why Now?

nothing like the present, I don’t need my forebrain smacked in the form of a wakeup call

Follow

Get every new post delivered to your Inbox.