On Tag Clouds, Metric, Tag Sets and Power Laws

Note: This entry is connected also to a mindmap. Some people were having problems in opening the page because of that. As such the mindmap has been stored in a separate page, and can be viewed from here.

Introduction

As correctly pointed out by Jeffrey Zeldman tag clouds are becoming more and more popular. Yet I keep seeing services which should be using tag clouds that keep on using tag sets. It is not just a problem of programming a tool which can only support tag sets, but also but also of programming tools which might in principles produce tag clouds, but such that the users are not invited to use a tag if one already exists, and as such don’t generate a tag cloud.

Example of the first type of tools are Flickr, 43things, consuMating, tagsurf * , example of the second is the tagged version of the BBC* . In all those cases a tag set is used, where instead a tag cloud would be more appropriate. Some of the differences between a tag cloud and a tag set where explained in Vanderwal.net: Explaining and Showing Broad and Narrow Folksonomies. Let’s see them again, and see some consequences of those differences, which should clarify when is better to use one tool and when is better to use the other.

Tag Set vs. Tag Cloud

Tag clouds and tag sets are different kind of objects.

As a tag set, I mean a set of tags. With no order whatsover. Either a tag is part of the set, or it is not. The tags that one user uses to bookmark a single url in del.icio.us is a tag set (or a tagset).

Tag clouds (or tagclouds) are a multiset of tags. That is, a set of tag where each tag can appear with multiplicity higher than one.

The fact that tag the relative amount of times a tag appear follows a precise mathematical law, namely a power law, has been suggested by various people, before being measured, and the process clearly explained.

So, for example, up to today 842 people have bookmarked delicious popular page. Each with its own tag set. The resulting tag cloud having 26 different tags with a multiplicity of at least 3 (which are the ones automatically offered by delicious url page).

So the page delicious popular has been tagged up to today with the tags:

  • delicious: 401,
  • del.icio.us: 152,
  • popular, daily: 68,
  • links: 43,
  • web: 22,
  • bookmarks: 20,
  • news: 14,
  • cool: 13,
  • blogs: 9,
  • internet: 8,
  • blog: 7,
  • meta: 6,
  • folksonomy, memes, rss, technology, zeitgeist, tags: 5,
  • bookmark, reference, tools, search, community: 4,
  • geek, firefox: 3
  • …: 2
  • … …: 1

And the tags used one or two times are not automatically offered by del.icio.us, but can easily be generated too.

Now this means that effectively we have found a complicated function d that given a url (actually a URI) gives you back a multiset M of tags. This function is calculated when many people use their intelligence, and time to categorise the url from their unique point of view. The result of the function changes as times goes by, and if today there are 401 people who have tagged delicious popular pages you can be sure that next week there will be a different (generally greater) number of people.

So we have a function ‘d’ such that:
d(URI)-> M
with ‘M’ being a multiset.

When you add many tag sets, you generate a tag cloud. And when you add many tag clouds, you generate another tag cloud.

A tag set is fundamentally different than a tag cloud. It is simpler. If you just list the tags that were being used to tag a certan URL, without their multiplicity [...] you are actually collapsing the tag space, and identifying objects that might seriously be different

First take away message:
A tag set is fundamentally different than a tag cloud. It is simpler. If you just list the tags that were being used to tag a certan URL, without their multiplicity, or not permitting for the same tag to be used more than one time (like in flickr, 43things, consuMating, tagsurf), or not letting the user gain something out of adding an already existing tag to a tagcloud (like in the tagged version of the BBC), you are actually collapsing the tag space, and identifying objects that might seriously be different. Later I will show a way to find similar objects to a given one. This only works well with tag cloud, not with tag sets.

Consequences of seeing Tag Clouds as Power Law

Before I said that it has been suggested, measured and explained that tag clouds were power law. I was lying. Tag clouds are not power law. The limit of the process that generate tag clouds will eventually produce a power law. Until that day a tag cloud is just a multiset, which at each person adding its information approximate better and better a power law.

If a tag cloud was a power law, we could express it as an ordered list of tags, and a number. The number representing the steepness of the curve, and the list representing the tags, ordered from the most popular to the least popular.

But what would actually mean if the tag cloud was a power law?
If a tag cloud was a power law, we could express it as an ordered list of tags, and a number. The number representing the steepness of the curve, and the list representing the tags, ordered from the most popular to the least popular.

The fact that tag clouds only approximate power laws, means that if we try to express a tag cloud as a power law, we will be making an error, and although we might expect the error eventually to go to zero, at the beginning it might be quite massive. Also global cultural changes, even small ones, might shift the resulting aim from one approximated power law to another, thus temporarily generating a bigger error, until the change has been integrated in the curve.

For example the paper: Clay Shirky: Power Laws, Weblogs, and Inequality, has by now being bookmarked by 113 person on delicious. When it came out the term ‘long tail’, was not used. Long tails were always present, they were just not culturally recognised as such.
On the issue of October 2004 the article from Wired: The Long Tail came out. The article was an immediate hit, and on the same day in which the first person bookmarked the article 21 other person bookmarked it too. The link appeared on delicious popular, and a huge number of people read it, and bookmarked it. This article changed the way people looked at power law. Thus it changed the way people perceived the previous article from Clay Shirky. At the moment 8 people have tagged his paper as ‘longtail’, and 3 as ‘long_tail’. Today the only tags more common than ‘longtail’ are:
powerlaw, blogs, blog, blogging, web, network socialsoftware, shirky.
‘networks’ was used the same number of time as ‘longtail’, and ‘economics’ closely follows.
The first person to bookmark this document was angusf on the 9th of February 2003.
The first person to use the term ‘longtail’ to link to this document was manasgarg on the 27th of January 2005. Nearlky 2 years after the document was linked from delicious, and two month and a half after the Wired article was published. But what is more important: all the terms that now are more common than ‘longtail’ were at the time already present. Even ‘economics’ and ‘networks’ predates the term ‘longtail’.. Longtail was a newcomer. But it was not just a random variation or a product of the intrisec noise. This term was representing some fundamental changes in the way we look at powerlaw (at least in this comunity). So the term started to be used more and more frequently, and raised in the list. And we can expect to see it rise even further. As such the actual tag cloud is just an approximation of the (power law) tag cloud that would become if infinite people from our culture were to tag it. But if infinite people from the culture before the moment when longtail went popular were to bookmark this site, the tag cloud would still approximate a power law… just a different one. The Wired article somehow changed the culture, and the solution that the tag cloud was approximaying changed. As the approximated solution changed the error (the difference between the tag cloud at a certain point in time, and the infinte tag cloud it is trying to approximate) suddenly grew. And then, eventually, started to decrease again. Not only the error between the tag cloud and the approximated power law increased, but also the error between the the tag cloud and the nearest power law increased, as the fact of having a new term starting the ascension made the whole tag cloud wiggle around from its powerlawness. (I know it sounds funny, but it’s actually true)

Not only we could study a culture by studying the differences in the power law approximated by the tag clouds used by people of that culture. But we could even measure cultural eartquake by measuring the difference between the tag cloud being generated before a certain event, or after a certain event.

Not only we could study a culture by studying the differences in the power law approximated by the tag clouds used by people of that culture. But we could even measure cultural eartquake by measuring the difference between the tag cloud being generated before a certain event, or after a certain event. But I am disgressing. (Addendum: more about this here)

Does the fact that a tag cloud is not a power law, but just an approxiamtion, means that we shouldn’t represnt it as an ordered list of tags and its steepness? Of course not. On a perfect power law, you just need to know the relative amount of the first two terms of the list to calculate the steepness. This might actually not be true in our approximation. And we might need to apply some slightly more advanced math to find the steepness the tag cloud is approximating. That is probably the only practical difference this might make. And we need to remember that the list might actually change in time. But not that much.

Approximating the tag cloud with a power law

Tag Clouds can be approximated with power laws. As power laws can be coded by an ordered list and a number, so (albeit for an error) can tag clouds.

If a tag cloud M is an approximation of a power law function f, then f is itself an approximation of M. Thus we can approximate M with f, and give the ordered list of tags, and the steepness as the only data necessary to find out the power law, and thus the tag cloud. TagSchema in Web 2.0 needs data 2.0 pointed out that we might be rapidly be approaching a bottleneck in the way we treat data. And that we might need to change the way code data. Well, the idea of passing tag clouds as power laws might be part of the solution. For example delicious could install an API that given any URL returns the complete (!) ordered list of tags used, and the relative steepnes. This would then be enough for any search engine or greasemonkey scripts to regenerate the whole information. Of course this will not tell us the absolute values of the tags. But this is irrelevant.

Take away message:
Tag Clouds can be approximated with power laws. As power laws can be coded by an ordered list and a number, so (albeit for an error) can tag clouds.

Tag Clouds approximate a point in space

We have seen that tag clouds tend to approximate a power law. As a tag cloud grow, the absolute value of each tag will change, but the relative amount of each tag respect to each other (and respect to the number of people who have bookmarked the entry) will change much less. So if we consider the ratio number of people who have used a tag versus number of people who have bookmarked a page, this number will asymptotically converge to a certain number as more people bookmark the site.

Let’s put all this in a table:

Total number of people that has bookmarked the page: 852

Tag

Times used

Relative weight

delicious

401

0.470657277

del.icio.us

152

0.178403756

popular,
daily

68

0.079812207

links

43

0.050469484

web

22

0.025821596

bookmarks

20

0.023474178

news

14

0.016431925

cool

13

0.015258216

blogs

9

0.010563380

internet

8

0.009389671

blog

7

0.008215962

meta

6

0.007042254

folksonomy,
memes,
rss,
technology,
zeitgeist,
tags

5

0.005868545

bookmark,
reference,
tools,
search,
community

4

0.004694836

geek,
firefox

3

0.003521127

2

0.002347418

1

0.001173709

The first column will indicate the list of tags, ordered from the most common to the least common. The second column the absolute number of times a tag has been used, and the third column the relative weight the tag has according to the formula:

  • weight of tag t: # people using t/#total number of people

The absolute number will tend to grow linearly with time, but the weight of a certain tag will asymptotically converge to a certain value, as the total number of people goes to infinity. As such it will not fluctuate much.

As such we can say that each tag cloud not only approximate a power law, but also approximate a point in the n-dimensional hypercube, Hn . If every tag cloud approximate a different position in the Hn, then we can measure the distance between URL by measuring the distance between those points. But this is the subject of the next section.

Each tag cloud generates a point in Hn. As a tag cloud grows, the point might change, but the speed of change will decrease with the number of people that has bookmarked it. So eventually the position will only change of an irrelevant quantity.

Take away message of this section:
each tag cloud generates a point in the n-dimensional hypercube. As a tag cloud grows, the point might change, but the speed of change will decrease with the number of people that has bookmarked it. So eventually the position will only change of an irrelevant quantity. Being a power law the whole information can be easily encoded as an ordered list and a number representing the steepness of the power law. If we use only tag sets (or we use a software that forces or invites people to just use tag sets) we lose this information.

And now the final problem is: given a URL how do I find other similar URL?
But now the problem is trivial, as it will be explained in the next section: Using power laws to find similar pages.

Two points for the mathematicians & UberGeeks

  • Saying that the point is in Hn is true, but still a semplification. Note how the n standard simplex cuts the hypercube in two parts. Well the points will all be in the external part (farther from the origin), but all this is irrelevant at the present time. The prove is trivial.
  • a different formula is also possible:
    • tag t: # people using t/#total number of tags used by all the people
  • which is harder to calculate, but would let each person have the same voting power. If we use instead this second formula, the point will simply lie on the n standard simplex. Much nicer!

Using power laws to find similar pages.

Finding a document that is similar to a given one, can be trivially solved once we can find a measure of the distance between two documents. But since each document uniquely defines a point in the hypercube, then we can just define the distance between two documents as the distance between their positions in the hypercube. There are different kind of metric that can be applied, but for now I will just suggest the euclidean distance. So that if Document 1 has been tagged with:
tag1, tag2, tag3, with relative weight (as defined in the previous section): t1, t2, t3
And the second document has been tagged with: tag2, tag3, tag4, with weights: d2, d3, d4
Then the distance between the two will be:

Given a document we can easily find its distance from another document. And through this we can start to investigate the long tail of the web. This is also not as difficult as it might sound…

And this is quite cool, because it means that given a document we can easily find its distance from another document. And through this we can start to investigate the long tail of the web. This is also not as difficult as it might sound. Since a document can be approximated both as a power law and as a position inside the hypercube, then two document can be expressed as an ordered list of tags (and the steepness, but now we don’t need it), and the first terms are the more relevant for the position in the hypercube. So the first term of the list of a document need generally to be present in the list of the second document. Not necessarily in the same order… but that would help also.

One points for the mathematicians & UberGeeks

  • Of course other type of distances are possible. The euclidean metric is known as L2, as every term is raised to the second power, and then the square root is taken. I suppose that different type of metric will give different vicinty (L1, Lmax, …), and there is quite a lot to explore just in playing with various metric and seeing what appears to be near what in the various metric. I don’t exclude that people using Tag Sets can be seen as using a particular type of metric

Conclusions

Tag sets and tag clouds are different type of objects. Tag clouds can be approximated as a powerlaw. And viceversa, powerlaw can be approximated as a tag cloud. Power law can be fully calculated once the ordered set of elements is given and the steepness is specified. Thus to give the ordered list of tags in a tag cloud and the steepness, is (albeit approximating errors) enough to have a general understanding of the tag cloud. Each tag cloud also can be expressed as a point in the n-dimensional hypercube. As a tag cloud grows it will tend to asymptotically approach to a point in the hypercube, while the multiplicity of its tag will tend to apprach a power law distribution. Thus when we pass the information about the power law distribution, we are implicitly passing also the information about the position of the tag cloud as a point in the n-dimensional hypercube. Since we can easily define a metric inside the n-dimensiona hypercube, we also have a metric among tag clouds. It is then just a matter of calculation given a tag cloud to find its neighboring ones. None of this is possible using Tag Sets.

General Bibliography

Not a standard bibliography but most of the articles that inspired me to write this piece. This is just a short list, as the idea written in here have been inspired by many many other articles found on the net.

Some key documents

Clay Shirky: Ontology is Overrated: Categories, Links, and Tags
Social Bookmarking Tools (I): A General Review
Vanderwal.net: Explaining and Showing Broad and Narrow Folksonomies
Clay Shirky: Power Laws, Weblogs, and Inequality

On how tags follow a power law:

P.S.: Hierarchical Delicious Free Mind Map
Peer Pressure: Bootstrapping the Semantic Web
Ascription is an Anathema to any Enthusiasm: Tagging Powerlaw

Blog Entries on Tag Clouds


TagSchema: Web 2.0 needs Data 2.0

The Daily Report: Tag clouds are the new mullets
The Daily Report: Remove Forebrain and Serve: Tag Clouds II

Example of a blog that uses Tag Clouds

Blog: playgroundblues
Tag Clouds: A Response

Later Links

This work seem to have inspired some extra work.

Terrell Russell measures the relative weight of a tag cloud as a function of time, and indeed notices that it converges. A brief review here.

31 thoughts on “On Tag Clouds, Metric, Tag Sets and Power Laws

  1. Pingback: You’re It! » Blog Archive » Tag Sets Bad, Tag Clouds Good

  2. Pingback: Blog֮¼¡

  3. Michal Migurski

    I have a not-that-old powerbook, and the java applet on this page brought the entire machine to a grinding halt for several minutes. The Mac Java implementation isn’t so hot. Be kind, offer a link and a warning. :)

  4. Pingback: shimenawa

  5. Pingback: nicholasjon.com » Blog Archive » Tags and the relativity of data

  6. Pingback: J. Michael Arrington

  7. Pietro

    Stéphane: I’ll be happy to help, but you have to do the coding, I’m overwhelmed with work at the moment. Can you point me to a further explenation of the idea. Which documents you are refering to?

    Zheng: I read your post, thanks to Google translate service (first, second.)
    Very interesting.

    Pietro

  8. Pingback: Joe's Space

  9. Pingback: P.S.: » On Tag Clouds, Metric, Tag Sets and Power Laws

  10. Pingback: Paolo Massa Blog

  11. Pingback: Tag-tik

  12. Pingback: Bobtown » Blog Archive » Master thesis: weblog and clustering experiment

  13. Pingback: Mauro Cherubini’s weblog » Blog Archive » A possible experimental hypothesis: places description will converge

  14. Pingback: Cloudalicious - Watching Tag Clouds Over Time » Blog Archive » Launched!

  15. Pingback: P.S.: » Tagclouds and cultural changes

  16. Pingback: IB Weblog » Blog Archive » Tag in den Wolken

  17. Pingback: La Taberna del Turco » Posts antiguos » Nube de etiquetas

  18. Pingback: TechCrunch » Blog Archive » Profile: TagCloud

  19. Pingback: TechCrunch

  20. Pingback: Notes au fil de l’eau » P.S.: » On Tag Clouds, Metric, Tag Sets and Power Laws

  21. Pingback: tagrio » On Tag Clouds, Metric, Tag Sets and Power Laws

  22. Ben Hyde

    Very nice write up!

    One point I think it worth finding a niche for: consider your example page. It has lots and lots of tags that aren’t shown in the summary (presumably so the UI doesn’t become overwhelming). Tags like test, maps, words in foreign languages, etc. Given the power-law nature of this a surprising proportion of the volume of the tag set is found in these. Then there are all those people who posted it who presumably had tags on the tip of their fingers yearning to break free. Here‘s a posting that shows all the tags for a URL. I found it fascinating.

    Truncating the tail of power-law distributions has consequences.

  23. Pietro

    >Very nice write up!

    Thank you Ben, from you is worth double.

    Yes, I had to cut the long tail. But this was not for space limitation but because at the time I had no ready made software to screen scrape delicious. And I would agree that the whole info is extreemly interesting. Especially if we want to calculate the distance between two URL the exact shape of the tail might as important as the main tags. But this will depend upon the metric chosen.

    On the other hand I really think that it does not make much sense to just list all the tags, unordered (or ordered alphabetically). As time goes to infinity I suspect that each and every word might appear; each and every grammatical error, too (del.iciou.s as well as delifious). But we shouldn’t think that those tags would all appear with the same frequency epsilon. They will be diversified, too. And this differenziation might give insights too, thus the importance of the long tail expressed before.

    But if we just list all the tags we would, at the limit, just list every alphanumeric string. All the info would be gone.

  24. Pingback: faurholt’s fumleri » Tag Clouds

  25. asplake

    Following on from the last couple of comments, just how long can the tail be? We’re counting a finite number of discrete things so it’s not like the curve approaches zero asymptotically. Would it be useful perhaps to look at where the log/log line crosses the x-axis (i.e. where the tag multiplicity is 1), and what would it say about about any tags that lie there?

  26. Pietro

    Thanks for commenting,

    If we consider, as I was suggesting, the weight of the tags as the number of people who have used a tag divided by the number of people who have bookmarked the url, then the size (i.e. the weight) does indeed approaches zero. Using the weight, instead of the actual number is important to rinormalise the distance between different urls and make them comparable. If you have two urls which has exactly the same ratio among the tags, but one has been bookmarked double the amount of time then the other, the distance between the position in the hipercube of the weight would be zero. Instead the distance between the one bookmarked more and the one bookmarked less is bigger than zero, and exactly equal to the distance between the one bookmarked less and the origin of the coordinates.

    In other words: you have to use the weight (or other rinormalization techniques). It does make a fundamental difference.

  27. Pingback: » Visualizing time trends in how a site is tagged on del.icio.us: cloudalicious - Paolo blog: Ramblings on Trust, Reputation, Recommender Systems, Social Software, Free Software, ICT4D and much more

  28. Pingback: Profile: TagCloud | Tech 2 Up

Leave a Reply