Tweets

Follow @pietrosperoni (592 followers)

Categories

Partial translations

Have you ever tried google translate service? I know, if you did you wish you didn’t, unless you were bored, and were looking for some ways to amuse yourself. But you know, translating text is a really daunting task. Generations of PhD’s have been spent in progressing the state of the art just a little bit every time. I know what I am speaking about, I lived with some of them in COGS, at Sussex University. I remember reading somewhere that new, better automatic translators will soon be available. Good! We are waiting for them.

In the meantime…

I had this idea:
Have you ever tried to translate a page from a language you don’t know… quite well. But you are not also totally ignorant about. Something in between. Here in Europe is quite common. And the same is true when I read posts in Portuogese, or in American from people on the other side of the ocean.

Yes, I can try to use Google translate mechanism, but it doesn’t give me something easyto chew. Look at this post, for example:

Depois do high vem o low. É uma lei do universo.
E no low todo mundo é feio e o mundo é triste e é tudo um saco.

E eu já nem sei o que me move.
From here

Google translates it as:

es low.? a law of the universe. E in low everybody? ugly and the world? sad e? everything a bag.

E I j? nor I know what it moves me.

From my darling Alenahra.

I suppose a better translation would be:

After a high comes a low. It is a law of the universe. And in a low everybody is ugly and the world is sad and everything is empty.

And I still don’t know what is that moves me.

And Ale’ will tell me if I got it right.

My idea is that Google, instead of providing for a tentative answer should provide for all the possible translations for each word. Those translated words should appear when we point to a word with the mouse. I know it is a slow way of reading a document, one word at a time, but soon the reader will catch up the most common words, and will speed up.

What follow is an example. Move on the words to see the title appear. I used some simple translation that I could find. Obviously the tool I envision would have to be more professional.

Depois do high vem o low. É uma lei do universo. E no low todo mundo é feio e o mundo é triste e é tudo um saco.

E eu nem sei o que me move.

In Italy right now more and more people are getting confortable with english. If you werte to come here only 10 years ago most people would refuse to even try to speak engliish, even if they studied it in school. Now, I believe thanks to internet, people are reading english pages daily, the dictionary often ina corner of the desk, ready to be used. It would be helpful for them to have sucha system.

And I would finally learn Portuogese!

Porto Alegre, aspettami!

Special thanks to travlang.com for providing part of the translations.

Clustering Delicious Tags

I went on programming at my favourite Python program: Delimind.

In short: Made a new release of the Deli Mind program. Here is the source code (just remember to change it from a .txt to a .py). Now similar tags are clustered together.

  1. Here is how it looks like.
  2. Here is how the previous version looked like.
  3. The original from Brownhen (may he live long and prosper) used to be here, although now it is missing.

All on the same data. Mine, now.
Go and enjoy.
(Later addition: while the program works well for small databases of links, like mine at the time in which I wrote this entry, it doesn’t scale well on size. For this reason it crashes for most of the people who try to use it with more than 1000 bookmarks. For this reason I was forced to change the link on the cluster example to a database with fewer nodes.)

Now the tecnical stuff for those that have a bit more patience.

Tags are not all the same, some are more similar than others. So, for example, the tag “September11″ and “GeorgeBush” have more links in common than “GeorgeBush” and “intelligence”. The idea behind this version of DeliMind was to cluster tags that had links in common. Since distance is generally not a transitive property (if I am near to you, and you are near to Jim, I am not necessarily that near to Jim), while clustering is (if I and you are in the same cluster, and you and Jim are in the same cluster, then me and Jim have to be in the same cluster… unless people belong to different clusters, but that’s a complication).

So I started by making a matrix of relations among tags (all_dict). Each tag, respect to each other tag could either be

  1. Once contained in the other
  2. Identical
  3. Disjointed
  4. With # bookmarks in common

Then according to the number of links each of the two tags, and the number of links in common I invented a measure of similarity. If #A is the number of links in tag A, and #B is the number of links in tag B, and #AB is the number of links in common.
The the relative similarity (SAB) will be:
SAB= sqrt((#AB/#A)*(#AB/#B))

I actually played with various measures:
SAB= ((#AB/#A)+(#AB/#B))/2
SAB= Max(#AB/#A,#AB/#B)
They all went from 0 to 1, and were quite similar… (I am not going to discuss the relative properties)
But the first one just seemed the one that made more sense, and at the end, the resulting map was the one more close to my personal intuition of what should be in what cluster.

Once the similarity matrix was done I started studying the clusters. Generally for each triplet of tags A, B, C I would modify
SAC:=min (previous SAC, max (SAB, SBC))
And I would continue going through all possible triplets, and then starting again from the beginning until no new change were happening.

Why? The idea is that the similarity between two tags measure how easy it is to jump from one to the other. Visualise each tag as an island, and then you have an animal who can jump from one island to the other. But it can only jump up to a certain distance. So if he can find a succession of tags between two tags, A and B, where the similarity (the similarity is the inverse of the distance) is always above its jumping ability (that is, the distance is below its jumping ability), then the animal can move from A to B. If not A and B are in different clusters. Effectively unreachable.

But we don’t know how far can our beast jump. So in this way we end up having a similarity number that sais: somwhere, between A and B is possible to find a succession of tags, such that the distance is never above x, so SAB is equal to the minimum between the original SAB and x.

If it does feel complicated don’t worry. I got confused a few (hundred) times programming it. And just could not understand why those damn tags were not clustering… until I got it right.

So, now you have this nice matrix, only between your main tags (the one that are not contained in another tag, cfr previous version), and you (or actually I) need to cluster the tags.

Not also that you don’t need to cluster the tags only one time. Once you made a clustering (for animal which can jump d), you can still partition inside the clustering for animals that can jump less than d.
The first time I just asked him to cluster each possible number. That is, if a number was present assume that someone was able to jump exactly that distance. In this way I got a heavily clustered map. It was a mess, but a promising mess. I then saw that most of the interestign things were happening between distances of 0.333333 and 0.6666.

That is, it made quite sense to ask for the clusters generated by putting together tags that had one third of the links in common, and tags that had up to two third of the links in common.

This is how I got clusters:

  • porno, sex and eros
  • GeorgeBush, September11, politics, economy, historical, terrorism, usa
  • green, sustainability

Example of the Clustered Map
Then I just applied the same process in the subtags of each tag.

Ok, I can be satisfied, I can go and have something to eat.

As always, if you find it useful drop me a line, I appreciate.

Pietro

Hierarchical Delicious Free Mind Map

So, I just modified the deli.mind script, originally from brownhen.
The original would take the public bookmark from delicious and make a free mind map out of them.

(For those who have no time to read the whole post, I immediatly tell you that I modified the code. The new code can be found here, and an example is here -open some nodes to see the difference!-).

The program is written in python, and I wasn’t very happy with the result. I mean it was great to have the map, but at the same time I have so many tags, that it was pretty much useless. Now the fact is that we tend to reuse tags that we have already used. This generates a positive feedback dynamic, that tends to create a bunch of very common tags (even among your own tags) and many many tags used only one or two times. I bet you could also plot them into a nice power law picture (but, alas, you need at least 1000 tags, to make it statistically meaningful!). This is generally true, but is particularly true for people who, like me, tend to store each link with around 10 different tags. This means that this long list of tags, that was using up my screen, was mainly composed of completely unimportant tags, with only few interesting among them.

Not only this, but some tags, tend to appear only in conjunction with other tags. For example, the tag “python” comes always with the tag “programming”. In a sense it is a “sub tag”.

Oops, are we back into hierarchy, aren’t we?

Well, not exactly, first the same link can be present in different non hierachically related tags, and second two tags can have links in common, but not be completely hierarchically related (think about the tag ‘September11′ and ‘GeorgeBush’ as a good example). The last thing to note is that from time to time there are tags which have exactly the same links inside, either because they are synonimous (‘del.icio.us’ and ‘delicious’ for example) or because I had not stored enough links to differentiate between the two.

So the new program extracts the information about the relation among the tags, and uses it to build a more interesting mind map.

More precisly two tags can be:

  • Identical,
  • One inside the other,
  • Viceversa,
  • With a non empty intersection, but with some extra links,
  • Completely disjointed.

This information is then used to create the new mind map.

With the following novelties:

  • Sub tags are shown as a sub branch of their parent tag.
  • Tags that are equivalent are shown together with a little empty branch as their parent, to connect them all.
  • A sub tag can be sub tag to more than one tag.
  • Each tag also is followed by two numbers: # of links & # of sub tags.
    So you have an idea about how big is the tree you are going to explore.

Detail of a tag and its sub tagsYou can see my “hierarchical delicious free mind map” in java format here while the code is here.

I also fixed a couple of bugs. That would give some fake results. (i.e. being tagged as ‘socialsoftware’ does not mean being tagged as ‘war’, etc…)

This isn’t the end, I am planning to work on this some more, when I have time.

Disclaimer: This was also my first tentative hack in python. So I am sure I did plenty of things in a clumsy, slow and redundant way. But I am learning.

Acknowledgment: I am very grateful to brownhen., because if he didn’t release the first version of the script I would not have started at all.