After experimenting with MapReduce (Hadoop, Pig) last year, we recently ran some tests to check if its worth pursuing further for large data analytics.
- Environment: We ran these tests on Amazon's cloud (quick, cheap, no hassles :-)
- Test data: 500 Million and 1 Billion rows of simple observations (2 column data - Customer_ID and Amount)
- Computation: Simple (group_by and summation)
Here's what we found:
1) Hadoop scales well: Even when we doubled the data volume, processing time did not increase proportionately (notice the vertical distance between the curves)
2) MapReduce gains flatten out after a certain point: Beyond a certain # nodes, there are no more savings in computation time (notice the flattening of the curves). Scaling up infinitely won't make drop computation time to seconds :-|
3) Pay more to save more time: Processing time is a factor of # nodes. We can easily decide how much we're willing to pay based on the time savings (RoI) example, a high priced expert might chose to pay more to save more time, compared to what an analyst chooses.
Now that Hadoop has passed the "sniff" test, we plan to run a real-life computation on it. I'm also looking for experts to drive this forward now. If there's anyone interested, please leave a comment.
We need to keep Amdahl's law in mind to estimate the max. savings expected from parallelization.
Note: I'm also interested in running R on Mapreduce sometime in the future