Nov 14, 2012

Big Data ETL and Big Data Analysis

I was at Strata New York 2012 last month. Great conference! Thanks O'Reilly media for assembling the industry leaders and running it well.

I understand it was too crowded for some of my out-of-town friends. Stepping out to the streets of mid-town Manhattan for a breath of fresh air and calmness wasn't an option either. Maybe O'Reilly can get a bigger space next year?

My primary interest in Big Data analysis was structured data analysis i.e. crunching, munging (ETL) and analysis of large dataset in columns and rows.

My team deals with 1-2 Terabytes (~ 1 billion rows) of structured data (e.g sales transaction data) regularly for marketing/retail/healthcare analytics. Like others, we're spending a lot of time in Big Data ETL processes and less on Big Data Analysis. Someone at Strata New York captured this well,
80% of a Big Data development effort goes into data integration efforts and only 20% of our effort/time is spent on analysis, i.e. interesting things we want to work on
I want to flip this equation. I want to be able to spend 20% of our time/effort on Big Data ETL/integration efforts and 80% on Big Data analysis.

At Strata, I wanted to check if vendors and open source communities had simplified the Hadoop stack for my use. There have been many improvements in the past year and the products on my list is quite long. We've more players in Big Data space and the solution space is muddy. It is great to have more vendors, experts, communities focusing on Big Data but the product space is crowded, fragmented and CONFUSING (just listing all Apache products discussed at Strata needs 1 full page.)

I created a list of products to try out. I wish these products were easy to evaluate (legal paperwork, infrastructure footprint, and ease of setup and execute my use cases).

As far as ease of use and powerful data science workbench is concerned, I want to use something like R or even Excel for these steps (Big Data ETL and Big Data analysis), but they are both memory constrained. So, I need other options.

Did someone say, Hadoop? Yup, on it. I tried it a few years ago and we're exploring it now. Hadoop/MapReduce is THE infrastructure to power Big Data ETL and Big Data analysis.

I also believe that MapReduce and Databases are complementary technologies and the experts agree! See MapReduce and Parallel DBMSs: friends or foes?

Here's how I've framed my problem and thinking about the solution space.

Big Data ETL process

Take big structured dataset (multiple CSV files with total 100M-1B rows) and create DDL/clean/transform/split/sample/separate errors in minutes.

Solution options:
  1. Unix scripts (shell, awk, perl). Do all of the above in one pass (i.e. read each row only once) quickly. Start with Unix parallel processes, scale to multiple machines (mapreduce-style) only if needed
  2. Big Data ETL tools like Kafka?
  3. Open source ETL tools (e.g. Talend)
  4. Can commercial ETL tools do this in a few minutes/hours?
  5. Others?
Given any structured data from client (csv), our Big Data ETL workbench takes the data and processes it super fast (detect data types, clean, transform eg. change date formats to our internal standards, separate error rows, create sample, split into multiple clean files)

Raw data files have different schema that we auto-detect in processing (only string vs numeric types to begin with).

Big Data Analysis

Then we load this data for analysis:
  1. RDBMS for well-defined arithmetic/set-based analysis
  2. noSQL database (Lucene/Solr with Blacklight front-end for discovery). Blacklight project: Open source discovery app built on Lucene/Solr. Thinking of it as a discovery app for Big Data analysis. Facets on top of structured data. Slide-dice large structured dataset. We can add visualizations later (e.g. summaries etc.) Checkout this Stata session on Lucene-powered Big Data analysis which confirmed this design hypothesis
  3. The clean split files can be used in any stats tool as well for statistical analysis e.g. SAS for larger data sets, R for smaller ones (often the clean split files are small enough for R)
  4. New Big Data tools like Impala

We're building proof-of-concepts for Big Data ETL (Unix scripts) and Blacklight discovery app on top of Lucene/Solr. I will share it when its ready. Stay tuned.


  1. Nice post. You should try Python. There are number of libraries there to deal with big data.

  2. Good overview as a starting point.
    My first idea was also Python. it definitely beats shell etc. in terms of coding comfort and lib support. and it is more lightweight and adaptable than the ETL tools. however I'm not sure about its parallelization capabilities, such as load balancing over various processes, cores, and machines.
    => loads of space to explore

  3. We've written AWK program to automate the detection of schema, splitting, sampling, cleaning, transforming files with good success. We are running it on MAWK with GNU Parallel and the throughput is 1 minute to process 1 million rows on modern hardware (extrapolating it means that it will take us ~1000 minutes to process 1 billion rows).

    We brought the time down 3 - 5 times by using multiple machines in GNU Parallel setup (GNU Parallel is awesome). Next step is to run it on a Hadoop cluster and bring down the time by an order of magnitude (e.g. process 1TB/1Billion rows in less than 60 minutes).

    The "discovery" app using Blacklight and Solr is also progressing well and showing positive signs. We're continuing to explore it.
    Discovery app is Kayak-style faceted browser for structured data.

  4. Then again, Although training is necessary in many places up to a specific age, participation at school regularly isn't, and a minority of folks pick self-teaching, e-learning or comparable for their kids.

  5. Nice post and analysis info give us great advice thanks for share it paraphrasing machine .

  6. The expansion of internet and intelligence in business process lead the way to huge volume of data. It is important to maintain and process these data to be efficient in data handling. Hadoop Training in Chennai | Big Data Training in Chennai

  7. The strategy you have posted on this technology hepled me to get into the next level and had lot of informations in it.
    salesforce training in chennai | salesforce training institute in chennai

  8. Survey analysis is often presumed to be difficult - the reality may not be so. Survey analysis in a survey research can be classified into two - quantitative and qualitative data analysis. This article tries to throw light on the basic differences between the two techniques. See more data analysis in qualitative research