The purpose of this assignment is for you to solve a problem of your choosing using Hadoop. You may use your choice of the Google-IBM cluster, Amazon's AWS, your own computer, or a machine in WeH 5419C/D caluster. Access details for the clusters will be provided in the recitation after the proposals are due.
The rest of the lab is once again partnered.
First Step: Research the Problem
Examine the small chunk of the Wiki dump available in the handout directory. Search wikipedia and google for problems that other people have tried to solve with the wikipedia data. Look for the types of questions that fit the map-reduce paradigm. What we want turned in is a list of URL's with short summaries of the problems you found. You may also include any ideas for your proposal you have come up with while researching.This part is due Sunday, April 5. Please dump a .txt file in the handin directory.
Second Step: Pick a Problem
The following Tuesday (April 7th) the actual proposal is due. We cannot stress how important a well thought out plan is to this lab. A good question will be easy to answer, a poorly designed question may well be impossible to answer. Here we want a couple of paragraphs describing in detail your plan. A paragraph describing the problem you want to solve followed by a few detailing your solution is the idea.We are trying to place the entire dump of the wiki. This will allow you to search through the entire edit history of wikipedia not just a dump of the current wikipedia pages.
Again, please place a .txt file in the handin directory that describes in detail exactly what you are attempting to do. Please be as detailed as possible as we can only interpolate what we are given.
After you turn it in Tuesday, we will provide you with lots of feedback in recitation on Wednesday. If we like your proposal, you will be given the go ahead to start. We will also hand out login information in recitation so please show up.
Further Dates
Tuesday, April 14th is Checkpoint 1 What we want here will be determined by your proposal but it will be something on the order of getting the data parsed and getting everything working with the google and/or aws cluster.
About the Wiki Dump
You might find the following resources helpful:
We're Here To Help!
As always -- remember, we're here to help!