Wikipedia has several million articles as of today and it is an excellent source for both structured and unstructured data. The Egnlish Wikipedia gets around 30K requests per second to its 3.5 million articles which is contributed by over 150 million active users. This post tries to bring out some of my experiences in mining Wikipedia.
Wikipedia has infoboxes that are sources of structured information. The screenshot attached is an example of an infobox. One can create a quizzing application using the structured infobox data in Wikipedia. The whole of Wikipedia is huge. It is difficult to download and process the whole of it. A small section for example, cricket, can be selected and all the articles from under the category can be downloaded by recursively crawling all the sub-categories and articles in the Wikipedia article. There are several libraries that can be used for the same. In python, a library called beautiful soup can achieve the same.
The code here prints a category tree. A category page like this has a list of sub-categories, expandable bullets and a list of pages. The function printTree is called on each article. It lists and downloads the ‘emptyBullets’ (leaf-categories) and ‘fullBullets’ (expandable-categories) and recursively calls the function for the sub-category. It avoids re-downloading duplicate pages. The getBullets function uses the Beautiful soup library to get all ‘<a>’ tags in the HTML article which have tags ‘CategoryTreeEmptyBullet’ and ‘CategoryTreeBullet’, the other functions seem simple enough to comprehend. The outcome of this function is the Category tree in Wikipedia. The function ‘download’ which uses wget can be replaced with a better download function that probably uses curl or some python HTTP library.
I have attempted to create a simple rule based ‘Named Entity Recognition’ program to classify these articles into people / organizations / tours and so on. This python file does the same. Here is a named-entity tagged version of the category.
Using the list of categories, we can get the list of articles in each category using this file, and the articles can be downloaded using this file. The code is pretty simple and I am mentioning these files here for the sake of completion. The body(article content) can be extracted using this file.
This file takes in an article and extracts the information in the infobox and outputs them in a structured format. I have written custom processors for processing infoboxes of people and infoboxes of people and matches and so on.
These outputs are in my own structure but they can be standardized to some format for example XML/JSON/RDF. A script can be used to do the same.