Sep 4, 2010

Cricket data analysis

Cricket World Cup 2011 is approaching and I'm interested in analyzing one day international cricket data to predict some results and share interesting information about cricket.


 
For the analysis, I need cricket data and tried several things to get it...
  • Personal research: Explored the web but couldn't find aggregated cricket data anywhere. There are many cricket-statistics oriented websites but none of them were useful (except Cricinfo, my favorite cricket website)
  • Reach out to my network: I requested my friends for advise last month and received many emails with information and offer to help compile the data
  • Reached out to Sports data companies: I contacted Opta Sports to buy who this data. Although they have the data, it was too expensive for my personal experiment.
So I decided to collect this data myself by web scraping cricket scorecards. I first tried to use R libraries to web scrape but found it lacking. So I switched to Ruby, which has a great library for web scraping - Hpricot (thanks Satty for getting me started and Amit/Thomas for solving my newbee issues).

I'm happy to report that now I have a robust Ruby script that can download all One Day International Cricket data (3000+ matches) in 3 handy files:

1) Win-Loss data:

Match_ID  Team1  Team2  Winner  Margin  First.Innings.Total  Second.Innings.Total  Ground  Matchdate  Ground_Country  Ground_Latitude  Ground_Longitude  Series
ODI no. 1  Sri Lanka  New Zealand  no result 203  Dambulla  Aug 19, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series
ODI no. 2  Sri Lanka  India  Sri Lanka  8 wickets 103 104  Dambulla  Aug 22, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series
ODI no. 3  India  New Zealand  India  105 runs 223 118  Dambulla  Aug 25, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series


2) Batting data:


Match_ID  Inning  Player  Country  Out  Runs  Minutes  Balls  Fours  Sixes  Scorerate
ODI no. 1 1 V Sehwag India lbw b Kulasekara 12
12 2 0 100
ODI no. 1 1 RG Sharma India lbw b Mathews 11
21 2 0 52.38
ODI no. 1 1 Yuvraj Singh India lbw b Malinga 38
64 5 1 59.37
ODI no. 1 1 SK Raina India c Sangakkara b Perera 8
16 1 0 50


3) Bowling data:

Match_ID  Inning  Player  Country  Overs  Maidens  Runs  Wickets  Economy
ODI no. 1 1 SL Malinga Sri Lanka 9 1 21 2 2.33
ODI no. 1 1 KMDN Kulasekara Sri Lanka 9 2 31 2 3.44
ODI no. 1 1 AD Mathews Sri Lanka 8 3 20 1 2.5
ODI no. 1 1 NLTC Perera Sri Lanka 7.4 1 28 5 3.65


    This Ruby script takes about 40 minutes on a fast internet connection to collect the data. It took me ~ 40 hours to write and fine tune the script. Most of the time was spent in dealing with typical data issues associated with web scraping and making the script generic to handle Test cricket and T20 cricket scorecards as well.

    8 comments:

    1. finally, a link that looks relevant. I've been tryin to search the similar content but of no use. Though I did get lot of Test Data and First Class data "the Classical Data" which is with hardly a few people globally and is of a great value. Help me with all the data I need about T20I and ODI and I'll give you all the data that I have. Reply me back at: vishall2402@gmail.com for the data you could provide me and hence for some samples of the data I have.
      ReplyDelete
    2. Hi, I am interested in the project too. Please email me at: mithaniDOTmuradATgmailDOTcom and we can discuss how to share data and proceed further.
      ReplyDelete
    3. Interesting post, even more interesting blog! A resource I stumbled across that I suspect you may also find interesting (if you haven't already) is at:
      http://ai.arizona.edu/research/sports_data/index.asp
      http://ai.arizona.edu/mis480/
      Happy to share some experiences with scraping etc (not using Ruby). Drop me an email at scottATmambooksDOTcom if you want to discuss.
      ReplyDelete
    4. Thanks for the interesting links SJ. KM in Cricket is the kind of insights I'm interested in too... Will shoot you an email...
      ReplyDelete
    5. Hi,
      I am Rajendran, and I teach at Zoho University (http://www.zohocorp.com and http://www.facebook.com/ZohoUniversity).
      While looking at ways to make SQL Querying interesting to my students, I suddenly wondered if it would be a good idea to teach them something they should know through something that they'd LOVE to know.
      Googled for cricket-ODI-data, and that's how I got here...

      What should I do to get my hands on your data gold mine?

      Rajendran.
      rajendran@zohocorp.com
      ReplyDelete
    6. Its nice to see people worked already on the topcis we interested in working at . I am a research student and want to do some analysis on cricket data . I would be greatly obliged if provided with the scrapped data . Thanks
      Adil
      adilmukarram@gmail.com
      ReplyDelete
    7. Hi i m omkar doing masters degree. I had this idea of mining criket data. But as you have already posted google and other sites dont come to your rescue. i appreciate the fact that u have taken trouble to search the data and also made your experience available for people like us. I tried collecting data manually but its taking long time. i would be grateful to u i could get a chance to discuss about this research and data collection.. my email id omkar.ghodge@gmail.com
      ReplyDelete
    8. Hey thanks for responding.. I m very much interested in knowing what kind of analysis u have performed or been performing on this data? U involved in data mining?
      ReplyDelete