Skip to content

Rhyming words using python

June 15, 2013
tags: , ,

Here is some simple python code that uses NLTK libraries to find rhyming words of a given level.

import nltk

def rhyme(inp, level):
entries = nltk.corpus.cmudict.entries()
syllables = [(word, syl) for word, syl in entries if word == inp]
rhymes = []
for (word, syllable) in syllables:
rhymes += [word for word, pron in entries if pron[-level:] == syllable[-level:]]
return set(rhymes)

print “word?”
word = raw_input()
print “level?”
level = input()
print rhyme(word, level)

Wikipedia Offline

August 6, 2011

Most of what my work online either involves checking mail or browsing forums for getting answers or reading Wikipedia for getting information or social networking. With LAN cuts introduced in the IITs, it is difficult for a student to access information after 12:10 unless they breakout somehow. In an earlier post, I had explained with references to my code, on how to download parts of Wikipedia, I thought it would be helpful to download the whole of Wikipedia on to your computer. In this post I will show you how Wikipedia / stack-overflow / gmail can be download for offline use.

Wikipedia

Requirements:

  • LAMP (Linux, Apache, MySQL, PHP)
  • Around 30 GB of space in primary partition 30 GB of space for storage. In my case the root partition
  • 7 GB of free Internet download
  • 3 days of free time

Wikipedia dumps can be downloaded from the Wikipedia site in XML format compressed in .7zip. This is around 6 GBs when compressed and expands to around 25GB of XML pages. It doesn’t include any images. This page shows how one can extract text articles from articles and construct corpuses from the same. Apart from this, a static HTML dump can also be downloaded from Wikipedia page (wikipedia-en-html.tar.7z) and this version has images in it. The compressed version is at 15 GB and it expands to over 200 GB because of all the images.

The Static HTML dump can simply be extracted to get all the HTML files and the required HTML file can be opened to view the required content. In case you download the XML dump, there is more – you have to extract the articles and create your customized offline Wikipedia.with the following steps.

  1. Download the latest mediawiki and install it on your Linux/Windows machine using LAMP/WAMP/XAMPP. Mediawiki is the software that renders Wikipedia articles using the data stored in MySQL.
  2. Mediawiki needs a few extensions which have been installed in Wikipedia.Once we have mediawiki installed say /var/www/wiki/, download each of them and install by extracting these extensions in the /var/www/wiki/extensions directory.
    The following extensions have to be installed – CategoryTree, CharInsert, Cite, ImageMap, InputBox, ParserFunctions (very important), Poem, randomSelection, SyntaxHighlight-GeSHi, timeline, wikihero which can all be found in the Mediawiki extensions download page by following the instructions. In addition you can install any template to make your wiki look like whatever you want. Now your own Wiki is ready for you to use, you can add your own articles but what we want now is to copy the original Wikipedia articles to our Wiki.
  3. It is easy to import all the data once and then construct an index for the data in MySQL than to update the index each time an article is added. Open MySQL and your database, the tables that are used in the import are text, page and revs. You can delete all the indexes on that page and create it again in the 5th step to speed up the process.
  4. Now that we have our XML database, we need to import it into the MySQL database. You can find the instructions here. In short, a summary of the instructions found on that page, the ONLY WAY you can get Wikipedia really fast on your computer is to use mwdumpertool to import into the database. The inbuilt tool in mediawiki won’t work fast and may run for several days. The following command can be used to import the dump into the database within an hour.
    java -jar mwdumper.jar --format=sql:1.5 <dump_name> | mysql -u <username> -p <databasename>
  5.  Recreate the indexes on the tables ‘page’, ‘revs’ and ‘text’ and you are done.

You can comment if you want to try the same or if you run into any problems while trying.

Stack-overflow

Requirements

  • LAMP (Linux, Apache, MySQL, PHP)
  • Around 15 GB of space in the primary partition and 15 GB of storage. In my case the root partition
  • 4 GB of free Internet download

media10.simplex.tv/content/xtendx/stu/stackoverflow has several stackoverflow zip files available for direct download. Alternatively, stack-overflow dumps can be downloaded using a torrent. A torrent download can be converted into an FTP download using http://www.torrific.com/home/. Once you have the dumps you can unpack them to get huge XML files for several stack sites. Stack-Overflow is one of the stack sites, the 7zip file is broken into 4 parts and have to be combined using a command (cat a.xml b.xml c.xml d.xml > full.xml) Once combined and extracted, we can see 6 xml files for each site (badges, comments, postHistory, posts, users, votes, ) Among these, comments, posts and votes may seem useful for offline usage of the forum. A main post may consist of several reply posts and each such post may have follow-up comments. Votes are used to rate an answer and they can be used as signals while you browse through questions. Follow the following steps to import the data into the database and use the UI to browse posts offline.

  • Download Stack sites
  • Create a database StackOverflow with the schema using the description here. (comments, posts and votes tables are enough)
  • Use the code to import the data to the database. (Suitably modify the variables serveraddress, port, username, password, databasename, rowspercommit, filePath and site in the code)
  • Run the code on Stack Mathematics to import the mathematics site. For bigger sites, it may take much more time and a lot of optimizations are needed along with a lot of disk space in the primary partition where the MySQL stores its databases.
  • Use the UI php files to view a post given the post number along with the comments and replies.
  • TODO: Additionally we can add a search engine that searches the table ‘posts’ for queries and returns post numbers which match the same.
Gmail offline
Requirements:
  • Windows / Mac prefered
  • Firefox prefered
  • 20 minutes for setup
  • 1 hour for download
Gmail allows offline usage of mails, chats, calendar data and contacts. You can follow the following simple series of steps to get gmail on your computer.
  • Install Google gears for firefox
    • You can install google gears from the site http://gears.google.com
    • If you are on Linux, you can install gears package. [sudo apt-get install xul-ext-gears]
    • Note: Gears works well in Windows, may fail on Linux
  • Login to gmail
  • Create new label “offline-download”
  • Create a filter {[subject contains: “Chat with”] or [from: <user-name>] -> add label “offline-download” to selectively download your conversations.
  • Enable offline Gmail in settings, and allow download “offline-downloa” for 5 years. You can select the period of time as well.
  • Start, it will end in around an hour and you will have your mails on your computer in an hour.
Offline gmail creates a database called [emailID]@gmail.com#database in your computer. The gears site gives you the location. You can find some information about offline GMail here.
If you want a custom interface for your mails / chats etc, you can create one which queries the SQLITE database mentioned above to present the content however you want. The software diarymaker can be used to read your chat data with plots of frequencies with time and rank your friends based on the interactivity. It works on Linux and uses the Qt platform. I will add a post on it soon.
Feel free to comment on any issue, if you have an idea for downloading any other kind of data on to your computer for offline usage, please let us know with a comment.
Update:
media10.simplex.tv/content/xtendx/stu/stackoverflow Now you can download stackoverflow directly. (Courtesy: Sagar Borkar)

Wikipedia Mining

August 6, 2011

Wikipedia has several million articles as of today and it is an excellent source for both structured and unstructured data. The Egnlish Wikipedia gets around 30K requests per second to its 3.5 million articles which is contributed by over 150 million active users. This post tries to  bring out some of my experiences in mining Wikipedia.

Infobox example

Infobox in a Wikipedia article

Wikipedia has infoboxes that are sources of structured information. The screenshot attached is an example of an infobox. One can create a quizzing application using the structured infobox data in Wikipedia. The whole of Wikipedia is huge. It is difficult to download and process the whole of it. A small section for example, cricket, can be selected and all the articles from under the category can be downloaded by recursively crawling all the sub-categories and articles in the Wikipedia article. There are several libraries that can be used for the same. In python, a library called beautiful soup can achieve the same.

The code here prints a category tree. A category page like this has a list of sub-categories, expandable bullets and a list of pages. The function printTree is called on each article. It lists and downloads the ’emptyBullets’ (leaf-categories) and ‘fullBullets’ (expandable-categories) and recursively calls the function for the sub-category. It avoids re-downloading duplicate pages. The getBullets function uses the Beautiful soup library to get all ‘<a>’ tags in the HTML article which have tags ‘CategoryTreeEmptyBullet’ and ‘CategoryTreeBullet’, the other functions seem simple enough to comprehend. The outcome of this function is the Category tree in Wikipedia. The function ‘download’ which uses wget can be replaced with a better download function that probably uses curl or some python HTTP library.

I have attempted to create a simple rule based ‘Named Entity Recognition’ program to classify these articles into people / organizations / tours and so on. This python file does the same. Here is a named-entity tagged version of the category.

Using the list of categories, we can get the list of articles in each category using this file, and the articles can be downloaded using this file. The code is pretty simple and I am mentioning these files here for the sake of completion. The body(article content) can be extracted using this file.

This file takes in an article and extracts the information in the infobox and outputs them in a structured format. I have written custom processors for processing infoboxes of people and infoboxes of people and matches and so on.

Here are examples of final outputs. (Article body of Sachin Tendulkar, Infobox extracted from article Sachin Tendulkar, an infobox on the Indian tour of England in 2002.)

These outputs are in my own structure but they can be standardized to some format for example XML/JSON/RDF. A script can be used to do the same.

Google_Intern

May 11, 2010

In another 6 days, my summer internship program at Google will begin. I have been waiting for over 7 months for this to happen. Google booked a flight for me to get to Bangalore and a cab to take me home from the airport. I am expecting the best from Google. So how did I get here? Since this is my tech blog and not the philosophical one, this post is about how I got my internship at Google.

I faced 3 rounds of telephonic interviews involving questions on algorithms and problem solving in a span of 3 weeks. This was my second interview. The first one by Microsoft did not go that well. I RG-ed myself in IITM lingo (caused my own failure). Learning from my mistakes, I was well prepared for the second one. One thing that I learnt from my mistakes was that I did not speak much about myself. I thought of everything that I had done and planned on what to speak about myself with the interviewer after that. I had been SPOJing out of interest and addiction. That did help me a lot.

My first interview lasted around 50 minutes. I spent the first 15 minutes introducing myself. Then started the programming round where I had to write a piece of code for the atof function. Although the task seems simple, converting an error free string to a floating point number, the actual work involved writing error free code and tracing the code like a machine and analysing the function to find the exact number of multiply-divide operations. My initial code took 2*N steps, I had to reduce it to 1*N multiplications using simple optimisation principles. The next question involved mapping an arbitrarily sized string to a number. That is quite trivial as its just conversion of bases. The final question involved generation of all possible subsets for a given set. A binary tree with recursion or a bit-vector iteration from 0 to 2^N are simple ways of generating all possible subsets of a set of size N. The interviewers were happy and I was selected for the next round scheduled after a week and a half.

The difficulty level of the questions increased with each round. This is clear considering the simplicity of the questions in the first round. The second round was interesting. The algorithmic question asked me to find out the total number of ways in which 2*N number of people sitting around a round table can shake hands so that no hands cross. I wasn’t aware about this problem and it turned out to have some interesting results as I discovered during the problem solving session. I quickly suggested a recursive approach where we divide the table, recursively calculate the value for the two halves, multiply them and sum it over all possible divisions of the table. The image in this link should make it clear. This was done in the first few minutes and I suggested a pseudo-code for the same. The interviewer then asked me to make it more efficient and gave me a hint asking me the number of times the value F(10) was calculated while calculating f(20). That suggested I had to use dynamic programming and memoisation. I suggested an alternate version for the same where you maintain an array ‘F’ and remember the outputs of the function ‘f’. I further suggested an iterative alternative for the recursion where f(n) was calculated after all f(i), i < n are calculated. The interviewer gave the last hint through which I realised that I was calculating nothing but Catalan numbers (A simple recurrence relation). The rest of the interview involved some simple questions on C++ language right out of the text book. (Private constructors, static functions, object method invocation after destruction …) The second interview was over in just half an hour.

The final interview was scheduled a couple of days after the second. The first question was a graph theoretical question where I had to find the maximum number of edges in an DAG. The right side implication is just a proof by example and the left side implication was proving that only nC2 pairs of unordered pairs (a, b) can be constructed. The next question was the first one where I actually wrote code. I had to write a program to find the number of 1’s (S(n)) in the series (0, 1, 10, 11, 100, 101 … B(n)) where B(n) is the binary representation of n. Eg S(5) is the number of 1’s in (0, 1, 10, 11, 100, 101). S(5) = 7.

I  remembered something that I learnt in my combinatorics class which suggested S(2^k – 1) is easy to calculate. S(2^k – 1) = k * 2^(k-1) is a very easy formula to derive using summation and I did that in the first 5 minutes.

000  001
010  011
100  101
110  111

S(7) = 3*(2^3) = 12

I had to write code which worked for any n. This involved extracting the first bit of the number and processing the rest of it. I had a vague recurrence relation in my mind and I spent over half an hour trying to ‘write’ the code which worked for all end cases. There was a small bug in my code and I could not fix it in time and the interviewer moved on to the next question. I could have typed out the code in my computer and executed and tested instead of manually tracing the program with code written on a piece of paper.

The last question involved suggesting a strategy for Google to suggest alternate search results in case the user makes an error. The “Did you mean …” part in a google page like here. I suggested naive strategies like flipping the characters in the query to see if better search results are obtained. The idea is clearly not scalable because of the sheer number of ways the user might have made the errors. I further went on to suggest using the keyboard key-placement and restricting the character flipping to just the neighbouring keys. This again wasn’t good enough. The interviewer wanted a better approach. I though for a while and suggested a directed graph structure for queries where there are directed edges from one string to another if there is a high probability of the suggestion. The edge-weights could be proportional to the probability of error. The closest few neighbours of a given node would list the most likely queries. A simple BFS could reveal the closest neighbours. The problem now was in the assignment of weights to the edges for which I had to suggest an algorithm. I thought on the lines of people querying for something, not being happy with the results, changing the query immediately to get a good response. The edge between query1 and query2 could be inversely proportional to the number of times the same happens. The interviewer was pleased with the approach. I then requested him to give me a few minutes to complete the program. I fixed the bug and submitted the code and it worked great in logarithmic time. (Linear on the number of bits in ‘n’)

In a couple of weeks, the HR called me up and asked me to send my certificates and informed me that I had been selected for internship at Google. 😀

Update: http://picasaweb.google.com/photos.jobs/IndiaOfficePhotos# has a collection of images that describe the environment in Google. Looks awesome!

exebit

October 15, 2009

This is a post publicizing exebit, the computer science department festival of IIT Madras. It will be conducted in the month of February next year. As I am one of the cores for the event, the web-ops core, I am posting this blog.

header_black

The first event was held this february and there was a decent turn-up for all talks and events. This year it is expected to be much bigger.

Here are a list of events that will be held in exebit.

Online events

Onsite events
Huge cash prizes will be given away for all the events. The description for the events can be found in the links. For more information about exebit visit http://www.exebit.org
Last year had big sponsors like nvdia, mac, maples, Ericcson, cadence … sposoring it. The events last year included
Online events
Onsite events
Hope exebit is going to be a big success this time. The cores for exebit are striving hard for that.

Coding@Shaastra2009

October 8, 2009

I am back after 2 months. This post is about the various coding events at Shaastra 2009. There were 6 of them, I took part in 5 out of 6 as I could not participate in Search 3.0.

First on the list is Frugal. A competition that was supposed to require minimal knowledge of coding and algorithms. The participants were expected to solve problems like finding square roots of a long long integer without using multiplication, division and bit shifting. Quite easy a n00b would say but all that 10000 times, in 1 second, no n00b can achieve that. After a whole day of coding I managed to submit 3 problems successfully. I did attempt 2 more problems but got small errors which I couldn’t fix in time. I finished at the 9th position, sadly no consolation prizes :(.

Here is some code that I wrote. No divisions or multiplications, calculates sqr root of a long int :-
The code is supposed to be as short as possible, have as few key words, semicolons, commas as required, should include only stdio.h and worst of all, should be written only in C.

frugal

I was one of the coordinators for HackFest. It was a non-competition hands on event. It was one of the biggest successes this Shaastra. It had two speeches, one by Atul Chitnis on FOSS and technology and the other by Shreyas Shrinivasan on FOSS Foundry on the first day of Shaastra. The two above speeches were targeted at a general audience with little knowledge on the open source world. It was supposed to kick start people who were ambitious to get into the world of FOSS.

In the evening were lighting talks by the pro-hackers at hackFest. The organizations that were invited included the linux kernel, firefox, KDE, GNOME, sugar and FFmpeg. In the night people hacked the software that they had chosen. The hacking nights were scheduled to start at 10:00 but we started it at 9:30 and taught people about xchat and the IRC. A lot of unregistered participants turned up and had to be allotted to the organizations that were free.

The three kernel hackers where the most enthusiastic people in hackfest. They taught the students from the basics and gave out free T-shirts for people who answered questions. They retained their enthusiasm till 5:00 in the morning consuming just 1 expresso shot somewhere in the midnight.

Firefox hacking was done by sid0 from IITK, he used jet-pack to help people create their own add-on which checks for unread messages in gmail. Gnome and KDE hacking brought up some potential contributors. Around 5 patches were submitted in Gnome and KDE. Vimzard took care of GNOME hacking and KDE was taken care by kstar, praaksh and yours truly. I made people create basic Qt applications, the other two took care of building KDE and hacking KStars a desktop planetorium software. Other organisations like FFmpeg and Sugar were also hacked.

The next event, the OPC was what I felt was one of the most challenging programming contests I have ever seen. It was tougher than google code jam round 1 or the ACM ICPC and the questions were tougher than most problems in spoj, Euler, uva and usaco. I teamed up with friends Garimella and RajKishan. We got the algorithms for a couple of problems. We actually solved one of them. The problem called power. At the end we stood at position 25. Sadly again, no consolation prizes.

opc

The next contest was polyglot. We had to write a single file program that would get compiled by different compilers successfully. There were a lot of languages which we could use. We solved 7 out of 8. Unfortunately we had trailing white-spaces in a few of our solutions so only 3 were considered to be correct. We won the 3rd prize at the end (team rocket Kashyap R Puranik and RajKishan G).

Here is a program I wrote that prints POLYGLOT in C++, python, perl, brainf**k and ruby.

polyglot

Here is a program we wrote that reverses a string in 3 languages. python, brainf**k and C++

polyglot1

As I said, I couldn’t participate in search 3.0 which was the most generous among the events, gaving out 25K to 3 people.

Simulation Championship was a great contest where the coordinators had selected 16 teams for the finals. 6 teams for Ant colony simulation, 5 each for Rescue bot and Viral marketing. Me an Rajoo decided to attempt both the ant simulation and the rescue bot problem statements.

We used Qt and C++ for both the problem statements, we did not have time to finish the rescue bot program. Our entry won the first prize for the ant simulation problem statement. Here are some screen shots of the same.

simC

Here you can see the trails left behind by ants, and you can see some ants following the food source.

simC1

Here is a picture of the rescue bots UI we made.

simC2

Kalzium-GSoC-2

July 1, 2009

Hi everyone,

I am updating my blog again. This time I have added the molecular mass Calculator. I have merged my work in the kalzium branch back to the trunk. I have changed a few files from libkdeedu/libscience for this.
After attempting to use the Gnome Chemical utils in vain, I used the calculator already present in Kalzium and improvised it.
Here are a few screenshots of the molecular mass calculator.

Screenshot

The calculator displays molecular mass and composition of the compound you input.

I have entered the formula CaSO4 (Calcuim Sulfate) here. I get the molecular mass as 136. I get the percentages of the elements too.

Screenshot-2

I have also used the idea of aliases for example Me ( Methyl group ) = CH3 and Et ( Ethyl group ) = C2H5 and so on.

In the following case, I have entered the compound toluene, which is Me-Ph which expands to ( CH3 ) ( C6H5 ).

molcal

We can also add our own aliases. Use the aliases tab of the molecular mass calculator widget, add a valid alias.
The picture shows default aliases that are available. I haven’t added many though.

Screenshot-1