Apr 19, 2010

Calling MapReduce/Hadoop experts

After experimenting with MapReduce (Hadoop, Pig) last year, we recently ran some tests to check if its worth pursuing further for large data analytics.

Test details:
- Environment: We ran these tests on Amazon's cloud (quick, cheap, no hassles :-)
- Test data: 500 Million and 1 Billion rows of simple observations (2 column data - Customer_ID and Amount)
- Computation: Simple (group_by and summation)

Here's what we found:
1) Hadoop scales well: Even when we doubled the data volume, processing time did not increase proportionately (notice the vertical distance between the curves)
2) MapReduce gains flatten out after a certain point: Beyond a certain # nodes, there are no more savings in computation time (notice the flattening of the curves). Scaling up infinitely won't make drop computation time to seconds :-|
3) Pay more to save more time: Processing time is a factor of # nodes. We can easily decide how much we're willing to pay based on the time savings (RoI) example, a high priced expert might chose to pay more to save more time, compared to what an analyst chooses.

Now that Hadoop has passed the "sniff" test, we plan to run a real-life computation on it. I'm also looking for experts to drive this forward now. If there's anyone interested, please leave a comment.

We need to keep Amdahl's law in mind to estimate the max. savings expected from parallelization.

Note: I'm also interested in running R on Mapreduce sometime in the future


  1. Yahoo has done extensive work on Hadoop. Maybe you should check them below



  2. This has actually been so much informative and wonderful blog from the many perspective, now I'v been intended to see it through more successful ways. Visit http://great-college-paper.com/ for best Paper

  3. It's in point of fact a nice and useful piece of information. I'm satisfied that you simply shared this helpful information with us. Please keep us up to date like this. Thank you for sharing. Check business phone calls for best Calls.

  4. In near future, big data handling and processing is going to the future of IT industry. Thus taking Hadoop Training in Chennai | Big Data Training in Chennai will prove beneficial for talented professionals.

  5. I agree with your thoughts!!! As the demand of java programming application keeps on increasing, there is massive demand for java professionals in software development industries. Thus, taking training will assist students to be skilled java developers in leading MNCs. Best Java Training in Chennai | JAVA Training Institutes in Chennai

  6. Hadoop is one of the best cloud based tool for analysisng the big data. With the increase in the usage of big data there is a quite a demand for hadoop professionals.
    Big data training in Chennai | Hadoop training Chennai | Hadoop training in Chennai

  7. Wonderful Post. With one of a kind substance, I truly motivate enthusiasm to peruse this post. I trust this article help huge numbers of them who looking this pretty data.
    Bigdata Training in Chennai | Bigdata Training

  8. Hi, you have given really informative post. Thanks for sharing this post to our vision. Learn Hadoop Online Training will helps you to reach your goal.Selenium Online Training

  9. You have done really great job. Your blog is very unique and informative. Thanks. Devops Online Training | Data Science Online Training

  10. This blog giving the details of the technology. This gives the details about working with the business processes and change the way. Here explains think
    Selenium Training in Chennai
    Selenium Course in Chennai

  11. Hi, I am really happy to found such a helpful and fascinating post that is written in well manner. Thanks for sharing such an informative post. keep update your blog. R Programming Online Training

  12. Excellent post! Thank you for Sharing. We are the best erp software providers in chennai. For more details call +91 9677025199 or email us on info@bravetechnologies.in ERP in Chennai

  13. Nice blog. Thank you for sharing. The information you shared is very effective for learners I have got some important suggestions from it. erp providers in chennai.

  14. Hai admin...now hadoop was trending course. Mapreduce is very important one in Hadoop language.so learn more interesting things about....
    Hadoop Training in Chennai
    Selenium Training in Chennai
    Dot Net Training in Chennai
    Android Training in Chennai