Jan 23, 2015

Microsoft to acquire Revolution Analytics: Now this gets interesting...

I'm excited about Microsoft acquiring Revolution R. Microsoft has created business friendly tools for many decades and this is a great opportunity for them to bring the power of R to common users. Waiting to hear more from Microsoft...

Dec 30, 2013

King Kallis - the greatest all rounder in world cricket

A hero's farewell
Jacques Kallis is retiring today (Dec 30, 2013) from Test cricket at the age of 38. He will continue to play One Day International and T20 cricket.

Hats off to a magnificent all rounder and a wonderful athlete. The way he carried himself on and off the cricket field is remarkable.

Kallis is a true champion and a great role model for future generations. Thank you Kallis for inspiring us for two decades. Today, as I watch South Africa beat India in the Durban test, I cannot think of a better send-off for Kallis.

1. South Africa beat India convincingly
2. Kallis scored a century in this test
3. He completed 200 catches in test cricket in this game

As I watch him retire, I did some analysis to compare Kallis with other greats in cricket. I've shared it in this post. This analysis was done using R, Shiny and Ruby.

Best all rounders in Test cricket history
Over the years, Kallis gave his team a critical advantage and a wonderful balance as an allrounder. South Africa could pick an extra batsman or a bowler depending on the opposition and the conditions.

Kallis and Sobers stand out in all round performance in Test cricket history

In Test cricket, Kallis = Rahul Dravid + Zaheer Khan

Kallis has performed as well as a successful specialist Indian batsman and a successful specialist Indian bowler combined in Test cricket.

Jacques Kallis = Rahul Dravid + Zaheer Khan

Test cricket battingMatInnsNORunsHSAveBFSR100504s6sCtSt
Jacques Kallis165279401317422455.122858746.0844581475972000
Rahul Dravid164286321328827052.313125842.5236631654212100

Kallis = Dravid in Test Cricket (Batting)

Test cricket bowlingMatInnsBallsRunsWktsBBIBBMAveEconSR4w5w10
Jacques Kallis1652712016694992926/549/9232.532.8269.0750
Zaheer Khan891601797597683007/8710/14932.563.2659.915101

Kallis ~ Zaheer Khan in Test Cricket (Bowling)

In One Day International cricket, Kallis = Sourav Ganguly + Abdul Razzaq

Kallis has performed as well as a successful specialist Indian batsman and a successful specialist Pakistani bowler combined in One Day cricket.

Jacques Kallis = Sourav Ganguly + Abdul Razzaq

ODI cricket battingMatInnsNORunsHSAveBFSR100504s6sCtSt
Jacques Kallis325311531157413944.861586672.9417869111371290
Sourav Ganguly311300231136318341.021541673.70227211221901000

Kallis = Sourav Ganguly in One Day Cricket (Batting)

Jacques Kallis3252831075086802735/305/3031.794.8439.3220
Abdul Razzaq2652541094185642696/356/3531.834.6940.6830

Kallis = Abdul Razzaq in One Day Cricket (Bowling)

So who is the best all rounder? Kallis or Sobers?

ESPN's analysis of Kallis and Sobers
Is it Kallis or Gary Sobers? I won't get into the religious debate of declaring him as the best all rounder in test cricket history. Sobers and Kallis are both great all rounders - prolific run scorers and threat to the opposing batsmen.

I've had the privilege of watching Kallis in my lifetime and he is a great athlete - a batsman and a great bowler rolled into one. His performance in both Test and One Day cricket has been stellar. He's clearly the best all rounder in One Day cricket, and gives Sobers tough competition in Test.

Kallis is the best all rounder in One Day cricket history

Kallis batting performance in One Day cricket

Hope South Africa and international cricket find someone of Kallis' stature.

Reference: espncricinfo.com (some images and player statistics)

Dec 31, 2012

Software engineer's guide to getting started with data science

Many of my software engineer friends ask me about learning data science. There are many articles on this subject from renowned data scientists (DatasporaGigaomQuora, Hilary Mason). This post captures my journey (a software engineer) on learning Statistics and Data Visualization.

I'm mid-way in my 5 year journey to become proficient in data science and my learning program has included self-learning (books, blogs, toy problems), projects at work, class-room training (Stanford), teaching/presentations, conferences (UseR, Strata). Here's what I've done so far and what worked and what didn't...


a) Self-learning (2 - 4 months)

Explore if data science is for you

This is the key to getting started. Two years ago some of us at work formed a study group to review Stats 202 class material. This is what got me excited and started with data analytics. Only 2 of the 5 members of our study group chose to dive deeper into this field (data science is not for everyone).

b) Class-room training (9 - 12 months)

If you're serious about learning, enroll into a formal program

If you're serious about picking this skill, then opt for a course. The rigor of the class ensured that I didn't slack. Stanford offers great coursework to get started. They are far superior compared to many week-long training courses I've been to...


a) Spend 100% of my time on data science

  • Once I was hooked on data science, it was difficult to spend only 20% of my time on it to build expertise. I needed to spend 100% of my time on it, so I found work problems related to data science (big data analysis, healthcare, marketing & sales and retail analytics, optimization problems). 

b) Work on interesting problems

  • I aligned my learning goal with my passion. I found it energizing and engaging to solve interesting problems while learning new techniques. I was interested in retail, healthcare and sports (cricket) data analysis. 

c) Accelerate learning: 

d) Learn business domains

I'm lucky to have access to internal and external experts in data science, and they've helped me understand their approach to data science problems (how they think, hypothesize and test/assess/reject solutions). I've learned from them the importance of "Hypothesis-driven data analysis" rather than "blind/brute-force data analysis". This highlighted the importance of understanding the business domains really well before trying to extract meaningful insights from the data. This led me to understand operations research and marketing topics, retail, travel & logistics (revenue management) and healthcare industries. NY Times recently published an article highlighting the need for intuition.



  • Learning multiple Statistical tools: A year ago, I started getting some work requests for SAS programming, so I wanted to learn it. I tried to learn it for a month or so but could not do it. The main reason was learning inertia and my love for the statistical tool I knew already - R. I really didn't need another statistical tool. I could solve most of my data science problems with R and other software tools I knew. So my advice is that if you already know SAS, Stata, Matlab, SPSS, Statistica very well, stick to it. However if you're learning a new statistical tool, pick R. R is open source while most others are commercial software (expensive and complex).
  • Auditing courses: I tried to follow self-paced coursework from Coursera and other MOOCs but it wasn't effective for me. I needed the routine, the pressure of a formal course with proper grading to go through the rigor
  • Increasing academic workload: Manage work-life balance and work-commitments well. Earlier this year, I tried to take multiple difficult courses at the same time and quickly realized that I wasn't enjoying and learning as I should.
  • Sticking to course text book only: Many of the books in these classes are too "dense" for me (a software engineer). So I used other material to understand the concepts. E.g. regression from Carnegie Mellon notes

Comments, questions, suggestions are welcome!

Nov 14, 2012

Big Data ETL and Big Data Analysis

I was at Strata New York 2012 last month. Great conference! Thanks O'Reilly media for assembling the industry leaders and running it well.

I understand it was too crowded for some of my out-of-town friends. Stepping out to the streets of mid-town Manhattan for a breath of fresh air and calmness wasn't an option either. Maybe O'Reilly can get a bigger space next year?

My primary interest in Big Data analysis was structured data analysis i.e. crunching, munging (ETL) and analysis of large dataset in columns and rows.

My team deals with 1-2 Terabytes (~ 1 billion rows) of structured data (e.g sales transaction data) regularly for marketing/retail/healthcare analytics. Like others, we're spending a lot of time in Big Data ETL processes and less on Big Data Analysis. Someone at Strata New York captured this well,
80% of a Big Data development effort goes into data integration efforts and only 20% of our effort/time is spent on analysis, i.e. interesting things we want to work on
I want to flip this equation. I want to be able to spend 20% of our time/effort on Big Data ETL/integration efforts and 80% on Big Data analysis.

At Strata, I wanted to check if vendors and open source communities had simplified the Hadoop stack for my use. There have been many improvements in the past year and the products on my list is quite long. We've more players in Big Data space and the solution space is muddy. It is great to have more vendors, experts, communities focusing on Big Data but the product space is crowded, fragmented and CONFUSING (just listing all Apache products discussed at Strata needs 1 full page.)

I created a list of products to try out. I wish these products were easy to evaluate (legal paperwork, infrastructure footprint, and ease of setup and execute my use cases).

As far as ease of use and powerful data science workbench is concerned, I want to use something like R or even Excel for these steps (Big Data ETL and Big Data analysis), but they are both memory constrained. So, I need other options.

Did someone say, Hadoop? Yup, on it. I tried it a few years ago and we're exploring it now. Hadoop/MapReduce is THE infrastructure to power Big Data ETL and Big Data analysis.

I also believe that MapReduce and Databases are complementary technologies and the experts agree! See MapReduce and Parallel DBMSs: friends or foes?

Here's how I've framed my problem and thinking about the solution space.

Big Data ETL process

Take big structured dataset (multiple CSV files with total 100M-1B rows) and create DDL/clean/transform/split/sample/separate errors in minutes.

Solution options:
  1. Unix scripts (shell, awk, perl). Do all of the above in one pass (i.e. read each row only once) quickly. Start with Unix parallel processes, scale to multiple machines (mapreduce-style) only if needed
  2. Big Data ETL tools like Kafka?
  3. Open source ETL tools (e.g. Talend)
  4. Can commercial ETL tools do this in a few minutes/hours?
  5. Others?
Given any structured data from client (csv), our Big Data ETL workbench takes the data and processes it super fast (detect data types, clean, transform eg. change date formats to our internal standards, separate error rows, create sample, split into multiple clean files)

Raw data files have different schema that we auto-detect in processing (only string vs numeric types to begin with).

Big Data Analysis

Then we load this data for analysis:
  1. RDBMS for well-defined arithmetic/set-based analysis
  2. noSQL database (Lucene/Solr with Blacklight front-end for discovery). Blacklight project: Open source discovery app built on Lucene/Solr. Thinking of it as a discovery app for Big Data analysis. Facets on top of structured data. Slide-dice large structured dataset. We can add visualizations later (e.g. summaries etc.) Checkout this Stata session on Lucene-powered Big Data analysis which confirmed this design hypothesis
  3. The clean split files can be used in any stats tool as well for statistical analysis e.g. SAS for larger data sets, R for smaller ones (often the clean split files are small enough for R)
  4. New Big Data tools like Impala

We're building proof-of-concepts for Big Data ETL (Unix scripts) and Blacklight discovery app on top of Lucene/Solr. I will share it when its ready. Stay tuned.

Mar 20, 2012

Data Analysis Training

I'm training some of my colleagues on Big'ish data analysis this week. Here's how I'm running the class. Would love your ideas to make it better.


After completion of the course, you will be able to:
  • Understand concepts of data science, related processes, tools, techniques and path to building expertise
  • Use Unix command line tools for file processing (awk, sort, paste, join, gunzip, gzip)
  • Use Excel to do basic analysis and plots
  • Write and understand R code (data structures, functions, packages, etc.)
  • Explore a new dataset with ease (visualize it, summarize it, slice/dice it, answer questions related to dataset)
  • Plot charts on a dataset using R


  • Good knowledge of basic statistics (min, max, avg, sd, variance, factors, quantiles/deciles, etc.)
  • Familiarity with Unix OS


A) Intro to data science
  • Explain data science and its importance. Data-driven business functions e.g. MROI, mix optimization, IPL teams / fantasy teams, predictions
  • Big data
    - Definition: Data sets that no longer fit on a disk, requiring compute clusters and respective software and algorithms (map/reduce running on Hadoop).
    - Real big data problems: parallel computing, distributed computing, cloud, hadoop, casandra
    - Most analysis isn't Big Data. Business apps often deal with datasets that fit in Excel/Access
  • Products: Desktop tools (Excel (solver, what if), Access, SQL, spss, stata, R, sas, programming languages (ruby, python, java) -- stats libs in these languages, BI tools, etc.

B) Steps in data science
  1. Acquire data: "obtaining the data"... databases, log files... exports, surveys, web scraping etc.
  2. Verify data
  3. Cleanse and transform data: outliers, missing values, dedupe, merge
  4. Explore data: The first step when dealing with a new data set needs to be exploratory in nature: what actually is in the data set? Summarize, Visually inspect entire data
    - What does the data look like? summaries, cross-tabulation
    - What does knowing one thing tell me about another? Relationships between data elements
    - What the heck is going on?
  5. Visualize data
  6. Interact with data (not covered here): BI tools, custom dashboards, other tools (ggobi etc.)
  7. Archive data (not covered here)

C) Skills needed for data science
  • Statistics: Concepts, approach, techniques
  • Databasing: SQL
  • Scripting language: Ruby, Python
  • RegEx
  • Visual design: Story telling with charts
  • File handling: Unix preferred. awk, gzip, gunzip, paste, sort etc.
  • Office tools: Excel (plugins like Solver, What If)
  • Statistical tools: R, SAS, SPSS, Stata, MATLAB, etc.
  • BI tools: Qlikview, Tableau

D) Learning R

We will pick a tool to learn the concepts of data science. We will use R, a leading open source stats package. Why I started learning data science and picked R

Curriculum for Intro to R (R has steep learning curve. Purpose of this discussion is to get you started)

E) Where to go from here?
  • Learn adv techniques: sampling, predictions. Books, Conferences
  • Analyse your favorite dataset: e.g. Cricket data analysis
  • Compete (kaggle)
  • Learn other tools (Excel Solver, SAS etc.)



  • TBD

Jan 30, 2012

Explaining India's miserable Test cricket performance in 2011/2012

Today when I reminded a friend not to lose hope in Indian cricket (after recent whitewash in England and Australia), another friend commented, 
प्रसून जी , यह भारतीय क्रिकेट है ..यहाँ पर हर विक्टरी दुसरे दिन पुरानी हो जाती है ...You have to perform at your best ... After all they are getting unexpected money. They should deliver the goods as per citizens expectations...
My reply to my friend was, 
Agreed, they need to perform... And here's my explanation of our recent performance... After Aussie dominance ended a few years ago with Warne/Gilchrist/McGrath/others retiring, we've hit a period where most top Test teams are equal (Eng/Aus/SA/SL/Ind)... As a result, the #1 Test team in last 2-3 years has been the team that played most games at HOME. And I expect this trend to continue. Here's some proof: India became #1 by playing all strong team at home in 2008-10 period... So did England in 2010-11 and became #1... now look at what Pak is doing to Eng outside Eng (down by 2-0)... Aussies won at home 4-0 against us in 2011/12 but India beat Aus 2-0 not too long ago (at HOME)...

Now, don't misunderstand me... I'm not saying India's terrible performance is okay. It is NOT. Indian fans deserve to be pissed. Recent performance is terrible. Several innings defeat. No one firing. Giving up so easily. Really bad. But I'm highlighting the fact that the current years will cause joy and heart break for fans depending on where their teams are playing (Home games will bring JOY and Away games will BREAK THEIR HEARTS). Cricket fans, let us prepare ourselves for this. Our team's Test cricket performance will more or less depend on where they're playing (checkout ICC's Future Tours Program). Test cricket ranking won't mean much! Wasim Akram feels the same way.

Here are some charts that explain my point of view... 

1) See how India played more tests at home between 2008 and 2010. In 2011/12, they've mostly AWAY games. Similarly, in 2011 England played mostly at HOME

2) Notice how we see more RED (losses) than GREEN (wins) in the charts below, India and England's performance in AWAY games

3) Notice how we see more GREEN (wins) than RED (losses) in the charts below, India and England's performance in HOME games