I went on programming at my favourite Python program: Delimind.
In short: Made a new release of the Deli Mind program. Here is the source code (just remember to change it from a .txt to a .py). Now similar tags are clustered together.
- Here is how it looks like.
- Here is how the previous version looked like.
- The original from Brownhen (may he live long and prosper) used to be here, although now it is missing.
All on the same data. Mine, now.
Go and enjoy.
(Later addition: while the program works well for small databases of links, like mine at the time in which I wrote this entry, it doesn’t scale well on size. For this reason it crashes for most of the people who try to use it with more than 1000 bookmarks. For this reason I was forced to change the link on the cluster example to a database with fewer nodes.)
Now the tecnical stuff for those that have a bit more patience.
Tags are not all the same, some are more similar than others. So, for example, the tag “September11″ and “GeorgeBush” have more links in common than “GeorgeBush” and “intelligence”. The idea behind this version of DeliMind was to cluster tags that had links in common. Since distance is generally not a transitive property (if I am near to you, and you are near to Jim, I am not necessarily that near to Jim), while clustering is (if I and you are in the same cluster, and you and Jim are in the same cluster, then me and Jim have to be in the same cluster… unless people belong to different clusters, but that’s a complication).
So I started by making a matrix of relations among tags (all_dict). Each tag, respect to each other tag could either be
- Once contained in the other
- Identical
- Disjointed
- With # bookmarks in common
Then according to the number of links each of the two tags, and the number of links in common I invented a measure of similarity. If #A is the number of links in tag A, and #B is the number of links in tag B, and #AB is the number of links in common.
The the relative similarity (SAB) will be:
SAB= sqrt((#AB/#A)*(#AB/#B))
I actually played with various measures:
SAB= ((#AB/#A)+(#AB/#B))/2
SAB= Max(#AB/#A,#AB/#B)
They all went from 0 to 1, and were quite similar… (I am not going to discuss the relative properties)
But the first one just seemed the one that made more sense, and at the end, the resulting map was the one more close to my personal intuition of what should be in what cluster.
Once the similarity matrix was done I started studying the clusters. Generally for each triplet of tags A, B, C I would modify
SAC:=min (previous SAC, max (SAB, SBC))
And I would continue going through all possible triplets, and then starting again from the beginning until no new change were happening.
Why? The idea is that the similarity between two tags measure how easy it is to jump from one to the other. Visualise each tag as an island, and then you have an animal who can jump from one island to the other. But it can only jump up to a certain distance. So if he can find a succession of tags between two tags, A and B, where the similarity (the similarity is the inverse of the distance) is always above its jumping ability (that is, the distance is below its jumping ability), then the animal can move from A to B. If not A and B are in different clusters. Effectively unreachable.
But we don’t know how far can our beast jump. So in this way we end up having a similarity number that sais: somwhere, between A and B is possible to find a succession of tags, such that the distance is never above x, so SAB is equal to the minimum between the original SAB and x.
If it does feel complicated don’t worry. I got confused a few (hundred) times programming it. And just could not understand why those damn tags were not clustering… until I got it right.
So, now you have this nice matrix, only between your main tags (the one that are not contained in another tag, cfr previous version), and you (or actually I) need to cluster the tags.
Not also that you don’t need to cluster the tags only one time. Once you made a clustering (for animal which can jump d), you can still partition inside the clustering for animals that can jump less than d.
The first time I just asked him to cluster each possible number. That is, if a number was present assume that someone was able to jump exactly that distance. In this way I got a heavily clustered map. It was a mess, but a promising mess. I then saw that most of the interestign things were happening between distances of 0.333333 and 0.6666.
That is, it made quite sense to ask for the clusters generated by putting together tags that had one third of the links in common, and tags that had up to two third of the links in common.
This is how I got clusters:
- porno, sex and eros
- GeorgeBush, September11, politics, economy, historical, terrorism, usa
- green, sustainability
- …

Then I just applied the same process in the subtags of each tag.
Ok, I can be satisfied, I can go and have something to eat.
As always, if you find it useful drop me a line, I appreciate.
Pietro
Recent Comments