Tweets

Follow @pietrosperoni (390 followers)

Categories

Related Posts

Related posts:

  1. COP15 Needs an e-Government System This morning I received a mail from Copenhagen. It was very moving, and describing a situation of chaos, strong commitment, and braveness. It told the story of people fighting with non violence, and shouting that they want change. And I am afraid all this is useless. I feel once again what I felt looking at Iran [...]...

Tag Clouds are hard to Spam

I think the time have come to write my third, and hopefully last contribution to the topic of tagclouds.

I have been hearing a lot of talk on how users should not use too many tags in linking to url. I also am the maintainer of the mindmap maker, and I often look at some of the maps generated (available to everybody). There is a number of people who tend to use an average of between one and two tags per URL. Their maps are often very ordered. No clustering, no hierarchy. (Forgive me if I don’t put a link to such a map, but since I am going to bash this way of using delicious, I’d rather bash a method than a specific human being. Just go to the list of maps and open a couple, odds are one of them will be of the type I am describing). This way of using delicious uses tags as folders, just with the modification that every now and then you can put an URL in more than one folder at the same time. A bit like big bookstore might carry several copies of the same book, and store them in more than one place (and the Tao Te Ching, ends up in New Age -God knows why- and in Religion).

Of course tags tend not to fit exactly. My Tag Clouds and Cultural Change will be under Tags or Folksonomy or Sociology… Whatever you chose you probably will not put it under Ajax. And yet most of the analysis was done studying the spreading of the term Ajax.

Let’s make a few simple calculations.
I think we can agree that when a search from delicious gives us more than 20 results it’s starting to be unconfortable. We have n tags. Since we use often only one tag to store an URL I assume that search is done by looking for the URL’s in that specific tag. Like you would look inside a folder in your bookmarks. In each tag we can store about 20 links. SO the maximum naumber of url that we can store ends up being 20*n. Add to this that not all tags will be used with the same multiplicity, tags will (probably!) still follow a power law, and even if it is not going to be a precise power law you will have some tags way more used than others, and we can lower this number to 10*n. In any case something that grows linearly with the number of tags.
Let’s suppose instead that we use two tags for each URL. Having n tags all together. The total number of combination will then, hmm, mumble mumble mumble, (n*(n-1))/2.
And n*(n-1) is an even number because either n is even or (n-1) is even.
And that grows as the square of n. Much better.
But why stop there. What if you start using 3 tags.
Then we would have n(n-1)(n-2)/(3*2*1) = n(n-1)(n-2)/6
That grows nearly as the cube of n.
So, is this general? Can we always use more tags per URL to have more space.
Yes, it is part of a general rule, and no the general rule does not just grow (unless the total number of tags that you want to be using is really going to be unbounded).

Let’s rapidly look at the general rule.
If we remember that n! (n factorial) is equal to n*(n-1)*(n-2)*(n-3)*…*3*2*1
And if we use m tags for each URL out of a total of n the total number of possible URL we can store will be:
n! / ((n-m)! * m!)
Every first year student know this, or should. As this is the m’th term of the n’th line of the Tartaglia Triangle (aka Pascal Triangle !java applet)


1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 5 1
1 7 21 35 35 21 7 1

The number of URL we can store will generally grow until we reach n/2 and then will start to decrese again. If we use n different tags (out of n) to store each URL we are essentially using only one possible combination.

So we know we don’t want to use more than half of the total number of tags to store URL’s. But how are we going then to find out those URL? We are probably going to look for the intersection of those tags. Give me “Folksonomy” intersection “Sociology” in the example above. Only one of the two might give you too many links. On the other hand it’s very hard to remember where did we store an URL when we use only one tag. If we use a whole combination of m tags that will be even harder. But is not that hard to remember a couple of those tags. Or three if two ends up still returning too many results.

Use as many tags as fits. Make sure that you use different combination for different URL.[...] When you search use always couples of keywords. And increase keywords if you got too many results.

So the solution that I am suggesting is:
Use as many tags as fits. Make sure that you use different combination for different URL (if two URL’s have the same combination find the difference between the two and use the difference to burn/carve a new tag). When you search use always couples of keywords. And increase keywords if you got too many results.
And this is good, but we probably can do better. Yet this will give us more storage space than using a single term. So what I am actually suggesting is to use maybe five or ten tags for each URL (all the keywords that would fit!). And then search using 3 of those.

Tag Spam

But how is that different from tag spam. We have seen tag spam. People using a hundred tags for each URL as a way to increase visibility. Yet there are a number of differences.

  1. Their posts tend to have all the same tags, i.e. the tags they use don’t follow a power law respect to the times they are used. Each is often used the same amount of times as the others: the number of URL’s thay have stored
  2. For the reason above the tags are not useful to distinguish one URL from another.

In this a spammer is really different from someone who uses many tags.

spammer [...] tags [...] don’t follow a power law respect to the times they are used.
spam will have a deleterious effect on using tag metric to explore the long tail. If a spammer tags a common url with many improbable keywords, and then tags another url with the same set of keywords, that might be enough to make the two url very close.

Also spam will have a deleterious effect on using tag metric to explore the long tail. If a spammer tags a common url with many improbable keywords, and then tags another url with the same set of keywords, that might be enough to make the two url very close. The proof is left as a simple excercise to the reader. This is bad (although I am quite convinced that there is a lot of work that can be done to mitigate this effect by carefully working on which metric to use).

The fact that their set of tags don’t follow a power law might seem that we could use this in our fight against SPAM. It is not so. It would be quite easy for the spammer to change the algorithm, to follow exactly a power law. What I am about to suggest is way more fundamental and revolutionary.

Let’s use tags clouds to store links instead of tag sets!
At the moment each person uses a tag set to tag a link. Each tag has the same importance of each other. Order is irrelevant. But we know that left in the cultural wild tags tend to develop a very specific tag cloud where each tag has a very specific weight. Like the ingredients in grandma’ pudding each URL will have that much of this tag and that much of that tag. If this is true, and if tags are not just discrete folders, where something either is inside or is not, but are dimensions in a continuous space (…at its limit you might say that the absence of a tag respect to an url is a privative and as such is impossible. Just many many tags have such a low weight to be irrelevant, but let’s not get philosophical), then it really is limiting to let users just have the possibility to use a tag set to store an URL instead of a tag cloud. We are starting to understand a natural way to store a huge number of things, and it will not pass much time that tools will appear to let people bookmark URL with a tag cloud.

We are starting to understand a natural way to store a huge number of things, and it will not pass much time that tools will appear to let people bookmark URL with a tag cloud.

But how can URL be bopokmarked with a tagcloud, instead than a tagset? Isn’t it impractical? Think, it is already a pain in the ass to find all the tags that are relevant, now you also want us to store next to each tag its weight:
“I think this is worth 0.5, this 0.4, this 0.9 and this at 0.1. Oops it doesn’t sum up to 1, I need to start again. Let’s put them all as 0.25. Oops, it does not fit a power law. Delicious new interface thinks it’s spam and have deleted the whole URL. Grr, damn Pietro!” :-)

We can do better than that.
We know that:

  • tag clouds approximate a power law;
  • the sum of the weight should be renormalised to one;
  • the steeper the power law the shorter the list of tags above a certain weight is.
  • a power law can be approximated by a list of tags and a steepnes.
  • delicious does not consider the order of the tags.

And now, if we put all together… mix, add some salt. We could have an interface that:

  • takes the list of tags, ordered (!);
  • by looking at the number of tags, decides the best steepnes to use;
  • assigns to each tag a weight such that:
    • the tags ordered by weight end up being ordered as the user sent them,
    • the sum of the weights is one,
    • the tags weight follow a power law (as much as it makes sense in such a short list).

And all that is needed from the user point of view is to make sure that the tags are ordered from the most fir for that URL to the least fit to that URL.

And all that is needed from the user point of view is to make sure that the tags are ordered from the most fir for that URL to the least fit to that URL.

What would be the consequences of using tag clouds instead of tag sets?

  • First of all we know that the general tag cloud of a URL is obtained by summing up the tag sets of all the users. This is still true, we can sum up the tag cloud of each user and get a general tag cloud. Only, this time, user will all have the same weight. If there are u user who have tagged a URL, each user will be responsible for 1/u of the position of the result in the n-simplex. (While now people who use more than one tag are generally more relevant on the final position).
  • Even if someone uses 10000 tags the extra noise he is adding will rapidly approximate zero.
  • It will be impossible for users to add irrelevant tags to link together different article (unless this will happen as the joint effort of many users) using the tag metric described before.
    Search in tag space could have the possibility to use the extra information of the order of the tags, thus adding storage space.
    Roughly we go from
    n! / ((n-m)! * m!)
    to
    n! / (n-m)!
  • The inbox can be modified, so that instead of looking for one keyword we look for a position in the n-simplex (i.e. space of the normalised tag clouds), and only look for pages that appear withing a distance of d from that position. (This is important and not to be overlooked). So I could look for (”delicious, tool”; d=0.5) to find all the delicious tools ecc.
  • General search will also be made in a way similar to the inbox, just without the distance d. You give a position in the simplex (i.e. an ordered list of tags) and you receive a list of URL in order of distance from that point. Notice how the idea of giving an ordered list of tags is not new. In google, for example, the list of words you search on, is ordered. Try to switch the orderings of some of the terms in the list and see how the results change.
this way of using tag clouds instead of tag sets will heal the gap by forcing authors to use a normalised (to sum of weight of 1) tag cloudm with weights following a power law. And in such space it is much harder to spam.

There is also another reason why the order of tags should be considered to the point of considering it a powerlaw atg cloud instead than just a set. At the moment there are many people who are noticing (correctly!) a serious difference between user versus author tagging. Well, I claim that this way of using tag clouds instead of tag sets will heal the gap by forcing authors to use a normalised (to sum of weight of 1) tag cloudm with weights following a power law. And in such space it is much harder to spam.

But why is it harder to spam? Because each document have a place in the simplex. The visibility will generally be higher when you can reach that place, because that is where people will look for you. Right now spammers tend to add keywords to be present in multiple places. But if the weight of tags gets rinormalised, then it will be like forcing URL to be present in one and only one place of the general simplex. Trying to look for a better position might increase their visibility but not their visibility toward the people who are actually looking for them.

All this of course require some changes. In particular:

  • delicious should be modified to consider as relevant the order of the tags, use it to calculate the relative tag cloud. Add that to the general tag cloud of a certain url, and use that tag cloud in its calculations.
  • Technorati might do the same.
  • Users need to be alerted that the order of tags is important.

In the meantime we could play with cloudalicious by renormilizing the weight of the tags that each user inserts by dividing it on the number of tags that that user have added.

Ok, hopefully in the next days I will clean up a bit this entry and make it nicer. And then I’ll be off, or so I hope. I am going on vacation. This summer I might go to the European Go conference in Prague, and maybe to the first conference on wikipedia. If you plan to go to either of them let me know, and we might have a something to drink together.

Related posts:

  1. COP15 Needs an e-Government System This morning I received a mail from Copenhagen. It was very moving, and describing a situation of chaos, strong commitment, and braveness. It told the story of people fighting with non violence, and shouting that they want change. And I am afraid all this is useless. I feel once again what I felt looking at Iran [...]...

17 comments to Tag Clouds are hard to Spam

  • Great post. What I find particularly interesting is that, on first reading, I vehemently disagreed with this:

    delicious should be modified to consider as relevant the order of the tags

    Because (I thought) the chances are that anything I’ve tagged with ‘folksonomy’ will also be tagged with ‘ethnoclassification’ and ‘taxonomy’ and ‘metadata’ – not to mention ‘tagging’, ‘tags’, ‘tag.cloud’ etc – and assuming that I’d entered the tags in a meaningful order would be inappropriate.

    But here are some of my actual del.icio.us tag combinations:

    tag.web tagging tags taxonomy folksonomy ethnoclassification
    tag.cloud tagging classification.facet ethnoclassification taxonomy tagging.social
    semantic_web tagging ontology taxonomy metadata
    tagging taxonomy metadata ontology semantic_web search
    tagging classification ethnoclassification folksonomy semantic_web
    tagging classification ethnoclassification folksonomy tag.cloud popularity long_tail

    And so on. Hardly any two URLs have exactly the same combination of tags, & in general the tags are in descending order of importance.

    I think this could work. Maybe a del.irio.us hack for somebody?

  • Though not rigorous prov, the analysis sounds convincing. Interesting thinking.

  • Hello Phil, the relation between logical hierarchy (personal onthology) and your tagclouds is in no way a simple one. And we are finally starting to break free from the chains of hierarchy that was stopping our mental process. Notice for example how people might differ vehemently on the more abstract tags, but the more a tag is concrete, the more it will be shared by many people. This should tell us something. For example that abstract tags are less important, not more important, than the more concrete ones. In the power law, abstract tags are on the lower part, in the tail. The concrete are probably more on the right. So, maybe it would make sense to follow this suggestion even when we order from the most important to the least important our tags:
    python snake animal biology
    instead of
    biology animal snake python.

    After all if you are trying to equally divide the space, it seem that
    python snake animal biology
    viper snake animal biology
    python language IT
    Cpp language IT

    do spread in the space more evenly than

    biology animal snake python
    biology animal snake viper
    IT language python
    IT language Cpp

    But then, maybe, when you look for a document similar to a document on the language python you want to find documents about Cpp, more than documents about vipers. Which shows that it’s all to be proven that the metric I suggested a couple of posts ago is actually useful. Hmmm,

    Yes, I think hacking something on delirious might be a good training place, just to test practically the ideas.

    Blog֮¼¡:
    Hello, “Blog֮¼¡”, I so wish I was able to understand your post, that translated from google seem interesting, but too many parts of it are ununderstandable in the automatic translation.

  • Thank you for your work
    I admire the fluidity of your explanation and I am ‘flabbergasted’ (I love this word) by the implications of tag clouds. I have spent a great time reading you and will soon post something about it on my blog http://darjeelink.com.
    We are a social software company in France, new on this Web2.0 scene and still trying to figure what it’s all about and how we can live on it. Your article gave me a lot to ponder upon for markketing studies and knowledge management
    Thank you
    Alexis Perrier
    by the way what is the Go Eurpean Conf in Prague ?

  • Thanks Alexis,
    I am looking forward to see your articles and what use will you make of those ideas.

    The European Go conference, oops should have been European Go Congress is the biggest meeting in Europe for Go players, which is held every year in a different city.
    I use to use the tag newnet on my delicious page for what is now called web2.0 But in the last months so many documents started being on web2.0 that I forgot to use it.

  • good, hope we can meet there. :)

  • Impressive comment. Very very interesting and thank you for posting in my blog. You are right it is possible to spam in tagging urls but not in tag clouds. It is very interesting how the matematics you have under goes from permutation without repetitions (n! / ((n-m)! * m!) to permutations: n! / (n-m)!

    By the way

  • sorry, dumb fingers.

    Thank you for the comment.

    Another hint..the way to reach a Juventus 3 – Milan 3 follows the same rule that you have found: permutations with repetitions, probably you will know but it is a pleasure for me to remember my old friend mathematics.

  • what a good article, that’s helpful for our tags site!

  • Interesting, So, have you actually implemented it?
    Up to now I have been very deluded by how everybody keeps focusing on tag-sets, more than on tag-clouds.

    Could you give us more details?

    Pietro

  • We are using tags as a new thing. Tags have benefits but there is a ptential of loss also. As tagging is a new thing in blogging the search engines are just viewing the results but if in future the peoples uses tags as spam or increasing site rank then the search engines authority may take action against the sites using too many tags. I think moderate use of tagging is ok but using 100 tags with one post is unfair.

  • 硢ǩƺǩ

    ζǩݴηֲұ˴˹ıǩԭͻ ϵͳУԪ֮ͨͻͻԽ࣬Խ࣬ϵͳҲԽţҲԾ˾ԽԽд ҵУǩеĸǩ൱ϵͳеԪǩ֮Ĺ൱ͻáԣǩԪԽࣨԪԽࣩǩ˴˵ĹԽӣڲ϶ӦǣʹþܶıǩעͬһURLÖ»Ò»ö£©£Ç©Ò²Ô½Å£É´Í¨È¡ÖªÇ¿Ç±ÚµÄ´Â£Ù½ÖªÊ¶ÏµÍ³Ä½Ä¿ÔºÒ²Ô½Ç¿ ԣҾûӦηǩԪصŶֻһǩעUR…

  • Profile: TagCloud

    Service: TagCloud

    Launched: June 2005
    Location: Bellwood, PA
    Status: Corporate name is IonZoft
    What is it?
    TagCloud is a service that generates a “tagcloud” (see below) based on provided URLs or feeds. A tagcloud is basically a grou…

  • More on tags, mindmaps and folksonomy

    Pietro Speroni: Delimind
    Tags have become so popular, they have become an object of scientific study. The new term folksonomy describes “a practice of collaborative categorization using freely chosen keywords”. It’s called folkson…

  • [...] I also fixed a small feature in wordpress which from my POV has become a bug. Most of you have noticed that I assign many categories to the same post. I use them as tags, and I have explained before how I think we should be generous in tagging. Well, wordpress categories weren’t really designed for this, and although they changed a bit the code in Wordpress 2.0 to give the possibility to create new categories as you assign them, they were really not considering having people using hundreds of categories. Which is not that big number if you consider them as tags, and you consider that you might easily reach a vocabulary of few thousand keywords. So every time I would open the “edit post” page, I would only receive some categories. This was not only annoying, it was also very unfortunate, because I had to redefine all the categories again, or when I where to save the post, all the categories that did not appear where taken off. Very annoying. I asked for help in the wordpress forum, but no one came in my rescue. Then one morning, playing with mysql, I noticed that some categories had a parent which was not present anymore, and they would eventually not appear. Once I fixed that still the number of categories appearing in the edit post page had risen, but was not complete. I counted them: 100. A nice round number that sounds like someone’s decision. I went to the code. Played around, looked, poked, and finally I found the following line: function return_categories_list($parent = 0) { global $wpdb; return $wpdb->get_col(”SELECT cat_ID FROM $wpdb->categories WHERE category_parent = $parent ORDER BY category_count DESC LIMIT 100″); } [...]

  • [...] “if Pietro was right, and Dave was right (and I was right about how they fit together), does that mean Shelley [...]

  • [...] conclusions. If ‘cloudiness’ is a universal condition, del.icio.us and flickr and tag clouds and so forth don’t enable us to do anything new; what they are giving us is a live [...]

You must be logged in to post a comment.