Note: This entry is connected also to a mindmap. Some people were having problems in opening the page because of that. As such the mindmap has been stored in a separate page, and can be viewed from here.
As correctly pointed out by Jeffrey Zeldman tag clouds are becoming more and more popular. Yet I keep seeing services which should be using tag clouds that keep on using tag sets. It is not just a problem of programming a tool which can only support tag sets, but also but also of programming tools which might in principles produce tag clouds, but such that the users are not invited to use a tag if one already exists, and as such don’t generate a tag cloud.
Example of the first type of tools are Flickr, 43things, consuMating, tagsurf * , example of the second is the tagged version of the BBC* . In all those cases a tag set is used, where instead a tag cloud would be more appropriate. Some of the differences between a tag cloud and a tag set where explained in Vanderwal.net: Explaining and Showing Broad and Narrow Folksonomies. Let’s see them again, and see some consequences of those differences, which should clarify when is better to use one tool and when is better to use the other.
Tag Set vs. Tag Cloud
Tag clouds and tag sets are different kind of objects.
As a tag set, I mean a set of tags. With no order whatsover. Either a tag is part of the set, or it is not. The tags that one user uses to bookmark a single url in del.icio.us is a tag set (or a tagset).
Tag clouds (or tagclouds) are a multiset of tags. That is, a set of tag where each tag can appear with multiplicity higher than one.
The fact that tag the relative amount of times a tag appear follows a precise mathematical law, namely a power law, has been suggested by various people, before being measured, and the process clearly explained.
So, for example, up to today 842 people have bookmarked delicious popular page. Each with its own tag set. The resulting tag cloud having 26 different tags with a multiplicity of at least 3 (which are the ones automatically offered by delicious url page).
- delicious: 401,
- del.icio.us: 152,
- popular, daily: 68,
- links: 43,
- web: 22,
- bookmarks: 20,
- news: 14,
- cool: 13,
- blogs: 9,
- internet: 8,
- blog: 7,
- meta: 6,
- folksonomy, memes, rss, technology, zeitgeist, tags: 5,
- bookmark, reference, tools, search, community: 4,
- geek, firefox: 3
- …: 2
- … …: 1
And the tags used one or two times are not automatically offered by del.icio.us, but can easily be generated too.
Now this means that effectively we have found a complicated function
d that given a url (actually a URI) gives you back a multiset
M of tags. This function is calculated when many people use their intelligence, and time to categorise the url from their unique point of view. The result of the function changes as times goes by, and if today there are 401 people who have tagged delicious popular pages you can be sure that next week there will be a different (generally greater) number of people.
So we have a function ‘d’ such that:
with ‘M’ being a multiset.
When you add many tag sets, you generate a tag cloud. And when you add many tag clouds, you generate another tag cloud.
First take away message:
A tag set is fundamentally different than a tag cloud. It is simpler. If you just list the tags that were being used to tag a certan URL, without their multiplicity, or not permitting for the same tag to be used more than one time (like in flickr, 43things, consuMating, tagsurf), or not letting the user gain something out of adding an already existing tag to a tagcloud (like in the tagged version of the BBC), you are actually collapsing the tag space, and identifying objects that might seriously be different. Later I will show a way to find similar objects to a given one. This only works well with tag cloud, not with tag sets.
Consequences of seeing Tag Clouds as Power Law
Before I said that it has been suggested, measured and explained that tag clouds were power law. I was lying. Tag clouds are not power law. The limit of the process that generate tag clouds will eventually produce a power law. Until that day a tag cloud is just a multiset, which at each person adding its information approximate better and better a power law.
But what would actually mean if the tag cloud was a power law?
If a tag cloud was a power law, we could express it as an ordered list of tags, and a number. The number representing the steepness of the curve, and the list representing the tags, ordered from the most popular to the least popular.
The fact that tag clouds only approximate power laws, means that if we try to express a tag cloud as a power law, we will be making an error, and although we might expect the error eventually to go to zero, at the beginning it might be quite massive. Also global cultural changes, even small ones, might shift the resulting aim from one approximated power law to another, thus temporarily generating a bigger error, until the change has been integrated in the curve.
For example the paper: Clay Shirky: Power Laws, Weblogs, and Inequality, has by now being bookmarked by 113 person on delicious. When it came out the term ‘long tail’, was not used. Long tails were always present, they were just not culturally recognised as such.
On the issue of October 2004 the article from Wired: The Long Tail came out. The article was an immediate hit, and on the same day in which the first person bookmarked the article 21 other person bookmarked it too. The link appeared on delicious popular, and a huge number of people read it, and bookmarked it. This article changed the way people looked at power law. Thus it changed the way people perceived the previous article from Clay Shirky. At the moment 8 people have tagged his paper as ‘longtail’, and 3 as ‘long_tail’. Today the only tags more common than ‘longtail’ are:
powerlaw, blogs, blog, blogging, web, network socialsoftware, shirky.
‘networks’ was used the same number of time as ‘longtail’, and ‘economics’ closely follows.
The first person to bookmark this document was angusf on the 9th of February 2003.
The first person to use the term ‘longtail’ to link to this document was manasgarg on the 27th of January 2005. Nearlky 2 years after the document was linked from delicious, and two month and a half after the Wired article was published. But what is more important: all the terms that now are more common than ‘longtail’ were at the time already present. Even ‘economics’ and ‘networks’ predates the term ‘longtail’.. Longtail was a newcomer. But it was not just a random variation or a product of the intrisec noise. This term was representing some fundamental changes in the way we look at powerlaw (at least in this comunity). So the term started to be used more and more frequently, and raised in the list. And we can expect to see it rise even further. As such the actual tag cloud is just an approximation of the (power law) tag cloud that would become if infinite people from our culture were to tag it. But if infinite people from the culture before the moment when longtail went popular were to bookmark this site, the tag cloud would still approximate a power law… just a different one. The Wired article somehow changed the culture, and the solution that the tag cloud was approximaying changed. As the approximated solution changed the error (the difference between the tag cloud at a certain point in time, and the infinte tag cloud it is trying to approximate) suddenly grew. And then, eventually, started to decrease again. Not only the error between the tag cloud and the approximated power law increased, but also the error between the the tag cloud and the nearest power law increased, as the fact of having a new term starting the ascension made the whole tag cloud wiggle around from its powerlawness. (I know it sounds funny, but it’s actually true)
Not only we could study a culture by studying the differences in the power law approximated by the tag clouds used by people of that culture. But we could even measure cultural eartquake by measuring the difference between the tag cloud being generated before a certain event, or after a certain event. But I am disgressing. (Addendum: more about this here)
Does the fact that a tag cloud is not a power law, but just an approxiamtion, means that we shouldn’t represnt it as an ordered list of tags and its steepness? Of course not. On a perfect power law, you just need to know the relative amount of the first two terms of the list to calculate the steepness. This might actually not be true in our approximation. And we might need to apply some slightly more advanced math to find the steepness the tag cloud is approximating. That is probably the only practical difference this might make. And we need to remember that the list might actually change in time. But not that much.
Approximating the tag cloud with a power law
If a tag cloud
M is an approximation of a power law function
f is itself an approximation of
M. Thus we can approximate
f, and give the ordered list of tags, and the steepness as the only data necessary to find out the power law, and thus the tag cloud. TagSchema in Web 2.0 needs data 2.0 pointed out that we might be rapidly be approaching a bottleneck in the way we treat data. And that we might need to change the way code data. Well, the idea of passing tag clouds as power laws might be part of the solution. For example delicious could install an API that given any URL returns the complete (!) ordered list of tags used, and the relative steepnes. This would then be enough for any search engine or greasemonkey scripts to regenerate the whole information. Of course this will not tell us the absolute values of the tags. But this is irrelevant.
Take away message:
Tag Clouds can be approximated with power laws. As power laws can be coded by an ordered list and a number, so (albeit for an error) can tag clouds.
Tag Clouds approximate a point in space
We have seen that tag clouds tend to approximate a power law. As a tag cloud grow, the absolute value of each tag will change, but the relative amount of each tag respect to each other (and respect to the number of people who have bookmarked the entry) will change much less. So if we consider the ratio number of people who have used a tag versus number of people who have bookmarked a page, this number will asymptotically converge to a certain number as more people bookmark the site.
Let’s put all this in a table:
Total number of people that has bookmarked the page: 852
The first column will indicate the list of tags, ordered from the most common to the least common. The second column the absolute number of times a tag has been used, and the third column the relative weight the tag has according to the formula:
- weight of tag t: # people using t/#total number of people
The absolute number will tend to grow linearly with time, but the weight of a certain tag will asymptotically converge to a certain value, as the total number of people goes to infinity. As such it will not fluctuate much.
As such we can say that each tag cloud not only approximate a power law, but also approximate a point in the n-dimensional hypercube, Hn . If every tag cloud approximate a different position in the Hn, then we can measure the distance between URL by measuring the distance between those points. But this is the subject of the next section.
Take away message of this section:
each tag cloud generates a point in the n-dimensional hypercube. As a tag cloud grows, the point might change, but the speed of change will decrease with the number of people that has bookmarked it. So eventually the position will only change of an irrelevant quantity. Being a power law the whole information can be easily encoded as an ordered list and a number representing the steepness of the power law. If we use only tag sets (or we use a software that forces or invites people to just use tag sets) we lose this information.
And now the final problem is: given a URL how do I find other similar URL?
But now the problem is trivial, as it will be explained in the next section: Using power laws to find similar pages.
Two points for the mathematicians & UberGeeks
- Saying that the point is in Hn is true, but still a semplification. Note how the n standard simplex cuts the hypercube in two parts. Well the points will all be in the external part (farther from the origin), but all this is irrelevant at the present time. The prove is trivial.
- a different formula is also possible:
- tag t: # people using t/#total number of tags used by all the people
- which is harder to calculate, but would let each person have the same voting power. If we use instead this second formula, the point will simply lie on the n standard simplex. Much nicer!
Using power laws to find similar pages.
Finding a document that is similar to a given one, can be trivially solved once we can find a measure of the distance between two documents. But since each document uniquely defines a point in the hypercube, then we can just define the distance between two documents as the distance between their positions in the hypercube. There are different kind of metric that can be applied, but for now I will just suggest the euclidean distance. So that if Document 1 has been tagged with:
tag1, tag2, tag3, with relative weight (as defined in the previous section): t1, t2, t3
And the second document has been tagged with: tag2, tag3, tag4, with weights: d2, d3, d4
Then the distance between the two will be:
And this is quite cool, because it means that given a document we can easily find its distance from another document. And through this we can start to investigate the long tail of the web. This is also not as difficult as it might sound. Since a document can be approximated both as a power law and as a position inside the hypercube, then two document can be expressed as an ordered list of tags (and the steepness, but now we don’t need it), and the first terms are the more relevant for the position in the hypercube. So the first term of the list of a document need generally to be present in the list of the second document. Not necessarily in the same order… but that would help also.
One points for the mathematicians & UberGeeks
- Of course other type of distances are possible. The euclidean metric is known as L2, as every term is raised to the second power, and then the square root is taken. I suppose that different type of metric will give different vicinty (L1, Lmax, …), and there is quite a lot to explore just in playing with various metric and seeing what appears to be near what in the various metric. I don’t exclude that people using Tag Sets can be seen as using a particular type of metric
Tag sets and tag clouds are different type of objects. Tag clouds can be approximated as a powerlaw. And viceversa, powerlaw can be approximated as a tag cloud. Power law can be fully calculated once the ordered set of elements is given and the steepness is specified. Thus to give the ordered list of tags in a tag cloud and the steepness, is (albeit approximating errors) enough to have a general understanding of the tag cloud. Each tag cloud also can be expressed as a point in the n-dimensional hypercube. As a tag cloud grows it will tend to asymptotically approach to a point in the hypercube, while the multiplicity of its tag will tend to apprach a power law distribution. Thus when we pass the information about the power law distribution, we are implicitly passing also the information about the position of the tag cloud as a point in the n-dimensional hypercube. Since we can easily define a metric inside the n-dimensiona hypercube, we also have a metric among tag clouds. It is then just a matter of calculation given a tag cloud to find its neighboring ones. None of this is possible using Tag Sets.
Not a standard bibliography but most of the articles that inspired me to write this piece. This is just a short list, as the idea written in here have been inspired by many many other articles found on the net.
Some key documents
Clay Shirky: Ontology is Overrated: Categories, Links, and Tags
Social Bookmarking Tools (I): A General Review
Vanderwal.net: Explaining and Showing Broad and Narrow Folksonomies
Clay Shirky: Power Laws, Weblogs, and Inequality
On how tags follow a power law:
Blog Entries on Tag Clouds
Example of a blog that uses Tag Clouds
This work seem to have inspired some extra work.