Extracting Semantic Data from wikipedia Infoboxes. Dbpedia

One thing quite amazing about Wikipedia is the huge amount of information it provides. At the time of writing the Wiki data dump articles was at a size of 7.8 GB compressed, 34.8 GB uncompressed, mind you all that without any images. So much data and that too available free . But wikipedia has one problem, due to the mechanical additions of entries scraping wikipedia for any data is a serious pain in the ass(at-least for me). Then the other day i came across Dbpedia , a effort to make the data(information) in Wikipedia info boxes through a freely available, query-able interface.  The data is in RDF format and you can write semantic queries, more like questions asked to wikipedia and it gets better, the queries can be easily imported as a XML/JSON. At first glance this might seem completely trivial, in a way it is very trivial. But the good news is Wikipedia is expanding and Dbpedia is getting better. Imagine a world where one day non programmers become content producers rather than consumers. Thats what i believe Dbpedia will do to the internet and to top it up there are amazing tools like Exhibit which makes visualizing all the data very easy and fun.


Why riak might not be the DB for you

Riak at a first glance seems like a down from the heavens database of the future. Highly available, easily scalable and completely fault tolerant. We had used riak for a semi relational(our biggest mistake) application and these are a few problems we faced.

– Riak isnt your ordinary database its a key value store, its extremely hard to even write a simple user management system using riak. For example  searching is possible only based on the primary key of the bucket.

– Riak kept crashing for no apparent reason with logs showing nothing at all. With the app in staging the database kept crashing quite often with a CPU usage of 100%. We had hosted it on a 12GB machine, hence resources wasn’t  the problem.

– Although the instructions for creating a cluster of nodes was easy, adding new nodes to the cluster once the clusters had some data was almost impossible. The data did not replicate properly and the nodes kept crashing.

– Riak had a interface for indexing and searching the index using a solr like interface. One of our critical feature was relying on this. The search was returning the right result on the development and staging machines.  The Riak setup on the beta machine had a mind of its own, it never returned the right results

– Riaks logging system is horribly bad. Example debug message: {:fun => pre_commit}. Some one tell me which language is this?


The mistake we made

We cannot blame riak completely for our failure. We made this huge mistake of using riak as a relational database. It probably would have been ideal where there was huge amounts of data involved and queries involving map reduced where needed.



What you need to consider this isnt a rant on raik(May be except from the crashes and the logging), there are so much praise for riak out there, one mud sling will not change that. What the message for you from this is that, be completely aware why you would want to use riak in your application. Rather then trying to use a hip new database.

After weeks of sleepless night we finally moved to a different database(name not given to prevent controversy) and we  have not missed our sleep for the past few weeks.

To the abode of the gods

Very often in life we talk about doing that special thing after which we can strike it of from our bucket list. Quitting  your job to start that company , going on a world tour or something crazy on that lines. But it so happens that you only talk of doing things and never really end up doing it. For me its always been going to Himalayas, right from child hood my imagination had been captured by all the mythology, literature and off course the images. I had never really worried about when i would visit the hills (an apt romantic reference), until the day i got this call and voice from the other end ” We are going to the himalayas are you coming?” “I need some time to think” “tell us within four days,we have to book the tickets”. A cold feet attack and i was conjuring up all reasons as to why i will not be able to join in the trip, Leave from office? money? fitness? this seem so unreal, I am i really going to do it?. Finally all  doubts laid to rest the trip still seemed unreal. I would never imagine i would strike this off so early.

I really want to write more about the trip. But having re written the post a 100 times i could not come up with something which would do justice to the trip. Next trip I am going to carry a book with me which will capture the emotions as they will experienced. In case you landed here expecting something have a look at this facebook album.