Nov 14, 2012

Big Data ETL and Big Data Analysis

I was at Strata New York 2012 last month. Great conference! Thanks O'Reilly media for assembling the industry leaders and running it well.

I understand it was too crowded for some of my out-of-town friends. Stepping out to the streets of mid-town Manhattan for a breath of fresh air and calmness wasn't an option either. Maybe O'Reilly can get a bigger space next year?

My primary interest in Big Data analysis was structured data analysis i.e. crunching, munging (ETL) and analysis of large dataset in columns and rows.

My team deals with 1-2 Terabytes (~ 1 billion rows) of structured data (e.g sales transaction data) regularly for marketing/retail/healthcare analytics. Like others, we're spending a lot of time in Big Data ETL processes and less on Big Data Analysis. Someone at Strata New York captured this well,
80% of a Big Data development effort goes into data integration efforts and only 20% of our effort/time is spent on analysis, i.e. interesting things we want to work on
I want to flip this equation. I want to be able to spend 20% of our time/effort on Big Data ETL/integration efforts and 80% on Big Data analysis.

At Strata, I wanted to check if vendors and open source communities had simplified the Hadoop stack for my use. There have been many improvements in the past year and the products on my list is quite long. We've more players in Big Data space and the solution space is muddy. It is great to have more vendors, experts, communities focusing on Big Data but the product space is crowded, fragmented and CONFUSING (just listing all Apache products discussed at Strata needs 1 full page.)

I created a list of products to try out. I wish these products were easy to evaluate (legal paperwork, infrastructure footprint, and ease of setup and execute my use cases).

As far as ease of use and powerful data science workbench is concerned, I want to use something like R or even Excel for these steps (Big Data ETL and Big Data analysis), but they are both memory constrained. So, I need other options.

Did someone say, Hadoop? Yup, on it. I tried it a few years ago and we're exploring it now. Hadoop/MapReduce is THE infrastructure to power Big Data ETL and Big Data analysis.

I also believe that MapReduce and Databases are complementary technologies and the experts agree! See MapReduce and Parallel DBMSs: friends or foes?

Here's how I've framed my problem and thinking about the solution space.

Big Data ETL process

Take big structured dataset (multiple CSV files with total 100M-1B rows) and create DDL/clean/transform/split/sample/separate errors in minutes.

Solution options:
  1. Unix scripts (shell, awk, perl). Do all of the above in one pass (i.e. read each row only once) quickly. Start with Unix parallel processes, scale to multiple machines (mapreduce-style) only if needed
  2. Big Data ETL tools like Kafka?
  3. Open source ETL tools (e.g. Talend)
  4. Can commercial ETL tools do this in a few minutes/hours?
  5. Others?
Given any structured data from client (csv), our Big Data ETL workbench takes the data and processes it super fast (detect data types, clean, transform eg. change date formats to our internal standards, separate error rows, create sample, split into multiple clean files)

Raw data files have different schema that we auto-detect in processing (only string vs numeric types to begin with).

Big Data Analysis

Then we load this data for analysis:
  1. RDBMS for well-defined arithmetic/set-based analysis
  2. noSQL database (Lucene/Solr with Blacklight front-end for discovery). Blacklight project: Open source discovery app built on Lucene/Solr. Thinking of it as a discovery app for Big Data analysis. Facets on top of structured data. Slide-dice large structured dataset. We can add visualizations later (e.g. summaries etc.) Checkout this Stata session on Lucene-powered Big Data analysis which confirmed this design hypothesis
  3. The clean split files can be used in any stats tool as well for statistical analysis e.g. SAS for larger data sets, R for smaller ones (often the clean split files are small enough for R)
  4. New Big Data tools like Impala

We're building proof-of-concepts for Big Data ETL (Unix scripts) and Blacklight discovery app on top of Lucene/Solr. I will share it when its ready. Stay tuned.


  1. Nice post. You should try Python. There are number of libraries there to deal with big data.

  2. Good overview as a starting point.
    My first idea was also Python. it definitely beats shell etc. in terms of coding comfort and lib support. and it is more lightweight and adaptable than the ETL tools. however I'm not sure about its parallelization capabilities, such as load balancing over various processes, cores, and machines.
    => loads of space to explore

  3. We've written AWK program to automate the detection of schema, splitting, sampling, cleaning, transforming files with good success. We are running it on MAWK with GNU Parallel and the throughput is 1 minute to process 1 million rows on modern hardware (extrapolating it means that it will take us ~1000 minutes to process 1 billion rows).

    We brought the time down 3 - 5 times by using multiple machines in GNU Parallel setup (GNU Parallel is awesome). Next step is to run it on a Hadoop cluster and bring down the time by an order of magnitude (e.g. process 1TB/1Billion rows in less than 60 minutes).

    The "discovery" app using Blacklight and Solr is also progressing well and showing positive signs. We're continuing to explore it.
    Discovery app is Kayak-style faceted browser for structured data.

  4. Then again, Although training is necessary in many places up to a specific age, participation at school regularly isn't, and a minority of folks pick self-teaching, e-learning or comparable for their kids.

  5. Nice post and analysis info give us great advice thanks for share it paraphrasing machine .

  6. The expansion of internet and intelligence in business process lead the way to huge volume of data. It is important to maintain and process these data to be efficient in data handling. Hadoop Training in Chennai | Big Data Training in Chennai

  7. The strategy you have posted on this technology hepled me to get into the next level and had lot of informations in it.
    salesforce training in chennai | salesforce training institute in chennai

  8. Survey analysis is often presumed to be difficult - the reality may not be so. Survey analysis in a survey research can be classified into two - quantitative and qualitative data analysis. This article tries to throw light on the basic differences between the two techniques. See more data analysis in qualitative research

  9. Web designing is the process that involves collecting ideas and implementing them with the intent of creating the web pages for a website.
    Web Designing Course in Chennai | web designing training in chennai

  10. The strategy you have posted on this technology hepled me to get into the next level and had lot of informations in it. Python is one of the basic level programming and is very important one.
    Python Training in Chennai | Python Course in Chennai

  11. We always think about the data, but we never know how to use it properly at the certain moment. Oupfff.. After such information, I feel my brain have to take a rest. I need my amazing football. Actually, even the football makes to think about the data. For example, we analyze the game to understand who’ll be the winner! Go with me to that site and make sure!

  12. This comment has been removed by the author.

  13. Ethical hacking describes hacking performed by a company or individual to help them to identify potential threats on a computer or network.
    Ethical hacking Course in Chennai | Ethical hacking Training in Chennai


  14. Thanks for your marvelous posting! I quite enjoyed reading it, you happen to be a great author. I will remember to bookmark your blog and will eventually come back very soon. Go to best social plan for get more related topic. Have a nice evening!
    Situs BandarQ, Poker, Domino 99 Online Terpercaya - Situs Judi Bola Online Terbesar dan Terpercaya Indonesia

  15. You have done really great job. Your blog is very unique and informative. Thanks. Devops Online Training | Data Science Online Training

  16. Hi, Really your post was very informative. Today's internet era learn Hadoop Online Training will helps you to reach your goal.Selenium Training

  17. Nice sharing. R is a language and environment for statistical computing and graphics. Want to make a career in R Programming. Learn R Programming Training course @ GangBoard. We are the best provider of online training on evergreen technologies.

  18. Thank you for this valuable information. We are the best erp software solutions in chennai. Contact us on software providers in Chennai

  19. Excellent post! I heve read your blog it's very interesting and informative. Keep sharing.
    erp providers in chennai | erp software solutions in chennai

  20. Do you have a spam problem on this website; I also am a blogger, and I was wanting to know your situation; many of us
    have developed some nice practices and we are looking to trade methods
    with other folks, why not shoot me an email if interested.

  21. Thanks with regard to giving this type of terrific info

  22. Its like you learn my thoughts! You appear to understand a
    lot approximately this, like you wrote the book
    in it or something. I think that you just can do with a few percent to drive the message house a bit, however
    instead of that, that is excellent blog. A great read.
    I will certainly be back.

  23. Hello! This is my first visit to your blog! We are a group of volunteers
    and starting a new initiative in a community in the same niche.
    Your blog provided us beneficial information to
    work on. You have done a marvellous job!

  24. This website was… how do I say it? Relevant!!
    Finally I’ve found something which helped me.
    Thank you!

  25. Hello there, I found your web site by way of Google even as looking for a comparable
    subject, your website came up, it seems to
    be good. I’ve bookmarked it in my google bookmarks.
    Hello there, simply changed into aware of your weblog thru Google,
    and found that it’s really informative. I am gonna watch out for brussels.
    I’ll be grateful in the event you continue this in future.
    Many people shall be benefited from your writing. Cheers!

  26. I was recommended this website by my cousin. I am not sure whether this post is written by him as no one else know such detailed about my trouble.
    You’re wonderful! Thanks!

  27. Digital massage therapy chairs assist to boost
    the body’s blood stream flow, which possesses a few benefits.