Apr 19, 2010

Calling MapReduce/Hadoop experts

After experimenting with MapReduce (Hadoop, Pig) last year, we recently ran some tests to check if its worth pursuing further for large data analytics.

Test details:
- Environment: We ran these tests on Amazon's cloud (quick, cheap, no hassles :-)
- Test data: 500 Million and 1 Billion rows of simple observations (2 column data - Customer_ID and Amount)
- Computation: Simple (group_by and summation)

Here's what we found:
1) Hadoop scales well: Even when we doubled the data volume, processing time did not increase proportionately (notice the vertical distance between the curves)
2) MapReduce gains flatten out after a certain point: Beyond a certain # nodes, there are no more savings in computation time (notice the flattening of the curves). Scaling up infinitely won't make drop computation time to seconds :-|
3) Pay more to save more time: Processing time is a factor of # nodes. We can easily decide how much we're willing to pay based on the time savings (RoI) example, a high priced expert might chose to pay more to save more time, compared to what an analyst chooses.

Now that Hadoop has passed the "sniff" test, we plan to run a real-life computation on it. I'm also looking for experts to drive this forward now. If there's anyone interested, please leave a comment.

We need to keep Amdahl's law in mind to estimate the max. savings expected from parallelization.

Note: I'm also interested in running R on Mapreduce sometime in the future


  1. Yahoo has done extensive work on Hadoop. Maybe you should check them below



  2. This has actually been so much informative and wonderful blog from the many perspective, now I'v been intended to see it through more successful ways. Visit http://great-college-paper.com/ for best Paper

  3. It's in point of fact a nice and useful piece of information. I'm satisfied that you simply shared this helpful information with us. Please keep us up to date like this. Thank you for sharing. Check business phone calls for best Calls.

  4. In near future, big data handling and processing is going to the future of IT industry. Thus taking Hadoop Training in Chennai | Big Data Training in Chennai will prove beneficial for talented professionals.

  5. I agree with your thoughts!!! As the demand of java programming application keeps on increasing, there is massive demand for java professionals in software development industries. Thus, taking training will assist students to be skilled java developers in leading MNCs. Best Java Training in Chennai | JAVA Training Institutes in Chennai

  6. Hadoop is one of the best cloud based tool for analysisng the big data. With the increase in the usage of big data there is a quite a demand for hadoop professionals.
    Big data training in Chennai | Hadoop training Chennai | Hadoop training in Chennai