Skip to content

Wikipedia Offline

August 6, 2011

Most of what my work online either involves checking mail or browsing forums for getting answers or reading Wikipedia for getting information or social networking. With LAN cuts introduced in the IITs, it is difficult for a student to access information after 12:10 unless they breakout somehow. In an earlier post, I had explained with references to my code, on how to download parts of Wikipedia, I thought it would be helpful to download the whole of Wikipedia on to your computer. In this post I will show you how Wikipedia / stack-overflow / gmail can be download for offline use.

Wikipedia

Requirements:

  • LAMP (Linux, Apache, MySQL, PHP)
  • Around 30 GB of space in primary partition 30 GB of space for storage. In my case the root partition
  • 7 GB of free Internet download
  • 3 days of free time

Wikipedia dumps can be downloaded from the Wikipedia site in XML format compressed in .7zip. This is around 6 GBs when compressed and expands to around 25GB of XML pages. It doesn’t include any images. This page shows how one can extract text articles from articles and construct corpuses from the same. Apart from this, a static HTML dump can also be downloaded from Wikipedia page (wikipedia-en-html.tar.7z) and this version has images in it. The compressed version is at 15 GB and it expands to over 200 GB because of all the images.

The Static HTML dump can simply be extracted to get all the HTML files and the required HTML file can be opened to view the required content. In case you download the XML dump, there is more – you have to extract the articles and create your customized offline Wikipedia.with the following steps.

  1. Download the latest mediawiki and install it on your Linux/Windows machine using LAMP/WAMP/XAMPP. Mediawiki is the software that renders Wikipedia articles using the data stored in MySQL.
  2. Mediawiki needs a few extensions which have been installed in Wikipedia.Once we have mediawiki installed say /var/www/wiki/, download each of them and install by extracting these extensions in the /var/www/wiki/extensions directory.
    The following extensions have to be installed – CategoryTree, CharInsert, Cite, ImageMap, InputBox, ParserFunctions (very important), Poem, randomSelection, SyntaxHighlight-GeSHi, timeline, wikihero which can all be found in the Mediawiki extensions download page by following the instructions. In addition you can install any template to make your wiki look like whatever you want. Now your own Wiki is ready for you to use, you can add your own articles but what we want now is to copy the original Wikipedia articles to our Wiki.
  3. It is easy to import all the data once and then construct an index for the data in MySQL than to update the index each time an article is added. Open MySQL and your database, the tables that are used in the import are text, page and revs. You can delete all the indexes on that page and create it again in the 5th step to speed up the process.
  4. Now that we have our XML database, we need to import it into the MySQL database. You can find the instructions here. In short, a summary of the instructions found on that page, the ONLY WAY you can get Wikipedia really fast on your computer is to use mwdumpertool to import into the database. The inbuilt tool in mediawiki won’t work fast and may run for several days. The following command can be used to import the dump into the database within an hour.
    java -jar mwdumper.jar --format=sql:1.5 <dump_name> | mysql -u <username> -p <databasename>
  5.  Recreate the indexes on the tables ‘page’, ‘revs’ and ‘text’ and you are done.

You can comment if you want to try the same or if you run into any problems while trying.

Stack-overflow

Requirements

  • LAMP (Linux, Apache, MySQL, PHP)
  • Around 15 GB of space in the primary partition and 15 GB of storage. In my case the root partition
  • 4 GB of free Internet download

media10.simplex.tv/content/xtendx/stu/stackoverflow has several stackoverflow zip files available for direct download. Alternatively, stack-overflow dumps can be downloaded using a torrent. A torrent download can be converted into an FTP download using http://www.torrific.com/home/. Once you have the dumps you can unpack them to get huge XML files for several stack sites. Stack-Overflow is one of the stack sites, the 7zip file is broken into 4 parts and have to be combined using a command (cat a.xml b.xml c.xml d.xml > full.xml) Once combined and extracted, we can see 6 xml files for each site (badges, comments, postHistory, posts, users, votes, ) Among these, comments, posts and votes may seem useful for offline usage of the forum. A main post may consist of several reply posts and each such post may have follow-up comments. Votes are used to rate an answer and they can be used as signals while you browse through questions. Follow the following steps to import the data into the database and use the UI to browse posts offline.

  • Download Stack sites
  • Create a database StackOverflow with the schema using the description here. (comments, posts and votes tables are enough)
  • Use the code to import the data to the database. (Suitably modify the variables serveraddress, port, username, password, databasename, rowspercommit, filePath and site in the code)
  • Run the code on Stack Mathematics to import the mathematics site. For bigger sites, it may take much more time and a lot of optimizations are needed along with a lot of disk space in the primary partition where the MySQL stores its databases.
  • Use the UI php files to view a post given the post number along with the comments and replies.
  • TODO: Additionally we can add a search engine that searches the table ‘posts’ for queries and returns post numbers which match the same.
Gmail offline
Requirements:
  • Windows / Mac prefered
  • Firefox prefered
  • 20 minutes for setup
  • 1 hour for download
Gmail allows offline usage of mails, chats, calendar data and contacts. You can follow the following simple series of steps to get gmail on your computer.
  • Install Google gears for firefox
    • You can install google gears from the site http://gears.google.com
    • If you are on Linux, you can install gears package. [sudo apt-get install xul-ext-gears]
    • Note: Gears works well in Windows, may fail on Linux
  • Login to gmail
  • Create new label “offline-download”
  • Create a filter {[subject contains: “Chat with”] or [from: <user-name>] -> add label “offline-download” to selectively download your conversations.
  • Enable offline Gmail in settings, and allow download “offline-downloa” for 5 years. You can select the period of time as well.
  • Start, it will end in around an hour and you will have your mails on your computer in an hour.
Offline gmail creates a database called [emailID]@gmail.com#database in your computer. The gears site gives you the location. You can find some information about offline GMail here.
If you want a custom interface for your mails / chats etc, you can create one which queries the SQLITE database mentioned above to present the content however you want. The software diarymaker can be used to read your chat data with plots of frequencies with time and rank your friends based on the interactivity. It works on Linux and uses the Qt platform. I will add a post on it soon.
Feel free to comment on any issue, if you have an idea for downloading any other kind of data on to your computer for offline usage, please let us know with a comment.
Update:
media10.simplex.tv/content/xtendx/stu/stackoverflow Now you can download stackoverflow directly. (Courtesy: Sagar Borkar)
Advertisements
2 Comments leave one →
  1. December 21, 2011 8:49 am

    I was reading about the gmail offline part in this post. The only reason one would want to use gmail offline, would be to have that interface. If it comes to another interface, all the mails can be downloaded into thunderbird, right?

    • December 26, 2011 2:55 pm

      Yes, its only for the interface, not to forget the superior search capability.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: