Apr 22, 2011

Big data problems

I have big data problems.

I need to analyze 100s of millions of rows of data and tried hard for 2 weeks to see if I can use R for this. My assessment so far from the experiments...

1) R is best for data that fits a computer's RAM (so get more RAM if you can).

2) R can be used for datasets that don't fit into RAM using Bigmemory and ff packages. However, this technique works well for datasets less than 15 GB. This is in line with the excellent analysis done by Ryan. Another good tutorial for Bigmemory.

3) If we need to analyze datasets larger than 15 GB, then SAS, MapReduce and RDBMS :( seem like the only option as they store data on file system and access it as needed.

Since MapReduce implementations are clumsy and not business friendly yet, I wonder if its time to explore commercial analytics tools like SAS for big data analytics.

Can Stata, Matlab or RevolutionR analyse datasets in the range of 50 - 100GB effectively?

http://www.austinacl.blogspot.com (image)


  1. Do you really need the huge amount of rows in memory , perhaps chunking and the usage of list/hash elements succeed?

    My best friends for data-preparation with GB's of data currently (awk,mawk,python).

  2. Have you tried Python, it is also an open source language and easy for starters. More importantly, you can access R from Python almost seamlessly with the package RPY. I met a same problem as you (although less than 100GB you are facing), and solve it with Python. I also wrote a post on it http://www.mathfinance.cn/life-is-short-use-python/.

  3. Have a look at ROOT (root.cern.ch). It was created for particle physics data, and we routinely analyze ntuples with > 1B events. You can have the data split in multiple files but then merge it for analysis. It's got limitations too, but might be helpful.

  4. SAS is good for large datasets, as it has out-of-core algorithms. Splus can also do this. Revolution Computing's version of R does this. All of these are commercial products. In the open source domain I have found Python to be great. I would also look at open source databases, such as MySQL and SQLIte, but I haven't used these.

  5. You say SAS and MapReduce, but you can also use R with MapReduce, in case you (or your readers) didn't know.

    Check out RHIPE, for starters:


  6. Stata is also limited by the ram on your computer, so wouldn't help in this instance.

  7. I regularly analyse > 15 GB data sets using the standard R distribution (and am the author of the second article you reference). You do have to think and work somewhat differently from how the standard introductions to the language works which is obviously a problem. And of course it does depend on what you need to do - I did have problems around 100 million call records when I tried to do social network analysis the naive way [1] but I eventually found a more fruitful way of analysing that data set.

    Standard recommendations include the biglm, biganalytics, speedglm, and biglars packages, as well as DBI and friends.

    In general, and this is probably better suited for a blog post than a comment, my approach is to first work hard at the data selection and preparation to make sure I work on the right problem and then to look at algorithms that I can execute in chunks and then combine. The latter is of course also essentially what SAS does.


    [1] http://www.cybaea.net/Blogs/Data/SNA-with-R-Loading-large-networks-using-the-igraph-library.html

  8. Rick from SAS here. I think that the 2009 ASA Data Expo (http://stat-computing.org/dataexpo/2009/posters/) really helped expose many statistical programmers to the magnitude of data that corporations have to analyze every day. Taking part in the Expo was definitely an eye-opening experience for me, and it was fun to use SAS to analyze such a massive data set. For a summary, see http://support.sas.com/publishing/authors/extras/Wicklin_scgn-20-2.pdf

    In the open source world, Kane and Emerson's bigmemory package (http://www.bigmemory.org/) is a great addition to the R arsenal. For his work on bigmemory, Kane was awarded the 2010 Chambers Award by the ASA Sections on Statistical Computing and Statistical Graphics.

  9. Thank you everyone for the suggestions. I'm investigating a few solutions and will report back my findings. cheers

  10. Revolution Analytics develops an enterprise version of R that is specifically catered to working with Big Data and parallel processing. You might want to give them a look.


  11. I'm looking at Revolution R as well and like what I've seen so far. Their desktop product is easy to get up and running with and is really fast for analyzing 100s of millions of records.

    I'm also playing with FF package and will report back findings in a few weeks.

  12. Since memory is so cheap now, just install more memory on your machine so that you don't need a VM solution for Stata or R. I regularly work on a dataset that is >30G on Stata on a desktop computer that has 32G of RAM installed. As long as your RAM is greater than the dataset size, it will run fine.

  13. Think about this: for the cost of an Intel 160Gb SSD (320 or 510), one gets near RAM speeds from "disk files". I would expect that someone (not I, as my C is very old and clunky) will, in time, build a package leverages SSDs. As to just loading up on memory, most PCs these days ship with all DIMM slots filled with the maximum supported DIMM.

  14. I actually enjoyed reading through this posting.Many thanks.

    Scrum Challenges

  15. SQIAR (http://www.sqiar.com/solutions/technology/tableau) is a leading global consultancy which provides innovative business intelligence services to small and medium size (SMEs) businesses. Our agile approach provides organizations with breakthrough insights and powerful data visualizations to rapidly analyse multiple aspects of their business in perspectives that matter most.