Sep 4, 2010

Cricket data analysis

Cricket World Cup 2011 is approaching and I'm interested in analyzing one day international cricket data to predict some results and share interesting information about cricket.


 
For the analysis, I need cricket data and tried several things to get it...
  • Personal research: Explored the web but couldn't find aggregated cricket data anywhere. There are many cricket-statistics oriented websites but none of them were useful (except Cricinfo, my favorite cricket website)
  • Reach out to my network: I requested my friends for advise last month and received many emails with information and offer to help compile the data
  • Reached out to Sports data companies: I contacted Opta Sports to buy who this data. Although they have the data, it was too expensive for my personal experiment.
So I decided to collect this data myself by web scraping cricket scorecards. I first tried to use R libraries to web scrape but found it lacking. So I switched to Ruby, which has a great library for web scraping - Hpricot (thanks Satty for getting me started and Amit/Thomas for solving my newbee issues).

I'm happy to report that now I have a robust Ruby script that can download all One Day International Cricket data (3000+ matches) in 3 handy files:

1) Win-Loss data:

Match_ID  Team1  Team2  Winner  Margin  First.Innings.Total  Second.Innings.Total  Ground  Matchdate  Ground_Country  Ground_Latitude  Ground_Longitude  Series
ODI no. 1  Sri Lanka  New Zealand  no result 203  Dambulla  Aug 19, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series
ODI no. 2  Sri Lanka  India  Sri Lanka  8 wickets 103 104  Dambulla  Aug 22, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series
ODI no. 3  India  New Zealand  India  105 runs 223 118  Dambulla  Aug 25, 2010  Sri Lanka 7.8566667 80.6491667  Sri Lanka Triangular Series


2) Batting data:


Match_ID  Inning  Player  Country  Out  Runs  Minutes  Balls  Fours  Sixes  Scorerate
ODI no. 1 1 V Sehwag India lbw b Kulasekara 12
12 2 0 100
ODI no. 1 1 RG Sharma India lbw b Mathews 11
21 2 0 52.38
ODI no. 1 1 Yuvraj Singh India lbw b Malinga 38
64 5 1 59.37
ODI no. 1 1 SK Raina India c Sangakkara b Perera 8
16 1 0 50


3) Bowling data:

Match_ID  Inning  Player  Country  Overs  Maidens  Runs  Wickets  Economy
ODI no. 1 1 SL Malinga Sri Lanka 9 1 21 2 2.33
ODI no. 1 1 KMDN Kulasekara Sri Lanka 9 2 31 2 3.44
ODI no. 1 1 AD Mathews Sri Lanka 8 3 20 1 2.5
ODI no. 1 1 NLTC Perera Sri Lanka 7.4 1 28 5 3.65


    This Ruby script takes about 40 minutes on a fast internet connection to collect the data. It took me ~ 40 hours to write and fine tune the script. Most of the time was spent in dealing with typical data issues associated with web scraping and making the script generic to handle Test cricket and T20 cricket scorecards as well.

    22 comments:

    1. finally, a link that looks relevant. I've been tryin to search the similar content but of no use. Though I did get lot of Test Data and First Class data "the Classical Data" which is with hardly a few people globally and is of a great value. Help me with all the data I need about T20I and ODI and I'll give you all the data that I have. Reply me back at: vishall2402@gmail.com for the data you could provide me and hence for some samples of the data I have.

      ReplyDelete
    2. Hi, I am interested in the project too. Please email me at: mithaniDOTmuradATgmailDOTcom and we can discuss how to share data and proceed further.

      ReplyDelete
    3. Interesting post, even more interesting blog! A resource I stumbled across that I suspect you may also find interesting (if you haven't already) is at:
      http://ai.arizona.edu/research/sports_data/index.asp
      http://ai.arizona.edu/mis480/
      Happy to share some experiences with scraping etc (not using Ruby). Drop me an email at scottATmambooksDOTcom if you want to discuss.

      ReplyDelete
    4. Thanks for the interesting links SJ. KM in Cricket is the kind of insights I'm interested in too... Will shoot you an email...

      ReplyDelete
    5. Hi,
      I am Rajendran, and I teach at Zoho University (http://www.zohocorp.com and http://www.facebook.com/ZohoUniversity).
      While looking at ways to make SQL Querying interesting to my students, I suddenly wondered if it would be a good idea to teach them something they should know through something that they'd LOVE to know.
      Googled for cricket-ODI-data, and that's how I got here...

      What should I do to get my hands on your data gold mine?

      Rajendran.
      rajendran@zohocorp.com

      ReplyDelete
    6. AnonymousMay 31, 2011

      Its nice to see people worked already on the topcis we interested in working at . I am a research student and want to do some analysis on cricket data . I would be greatly obliged if provided with the scrapped data . Thanks
      Adil
      adilmukarram@gmail.com

      ReplyDelete
    7. Hi i m omkar doing masters degree. I had this idea of mining criket data. But as you have already posted google and other sites dont come to your rescue. i appreciate the fact that u have taken trouble to search the data and also made your experience available for people like us. I tried collecting data manually but its taking long time. i would be grateful to u i could get a chance to discuss about this research and data collection.. my email id omkar.ghodge@gmail.com

      ReplyDelete
    8. Hey thanks for responding.. I m very much interested in knowing what kind of analysis u have performed or been performing on this data? U involved in data mining?

      ReplyDelete
    9. Hi.. im working on a university project and I would be in need of the data. Is it possible if you can share the data (Win-Loss data) ? . its for non profit.
      my mail id: puneeth.pun10@gmail.com

      thanks in advance..

      ReplyDelete
    10. I would really like to get hold of this data... Please could you contact me in order to let me know what you would like in exchange for the data? Also, could you let me know as to how much cricket data you have?
      my email address: antongschaffer@gmail.com

      ReplyDelete
    11. Hi Prasoon, Appreciate your efforts!
      I am very much interested in such data.
      can you please get back with details on abhilesh.c@gmail.com.

      ReplyDelete
    12. can u pls mail me the data at prahalad300791@gmail.com. It is for a college project and will help so much. Thanks a lot

      ReplyDelete
    13. Can you please share the script/data that you have used? (naidu.vaddadi@gmail.com)

      ReplyDelete
    14. can you please share the Ruby script with me at suvranil1412@gmail.com. I would be really grateful. Thanks in advance.

      ReplyDelete
    15. hi I am first year Master's student looking for Cricket Stat data for my ML project.
      Could you plz share data or script with me.
      Email - aniket0123@gmail.com

      ReplyDelete
    16. Can you please email me the data, I am starting to learn R and doing data analysis. This data will be super useful, Please email me at tak2prasanna@gmail.com

      ReplyDelete
    17. Hi! I'm really keen to work on a similar kind of a project.

      Any help from you would be great. I'd like to just ask a few things. Any method of contacting you would do.
      E-mail: shreyakhurana11235@gmail.com

      ReplyDelete
    18. Hi - looks like a fab job done you! A lot of searches on the net and I arrive here. I am learning datamining techniques and would like to play with the cricket big data. How can I get access to your dataset? Would be really helpful if i can get access.BTW what's the prediction for WC2015?

      My id - agopals@yahoo.com

      Regards.

      ReplyDelete
    19. Hi - It's a great job. wonderful work. From past few months I am learning Hadoop ecosystem but till now I am not able to hands down on something interesting, Cricket is always interesting and you have a lot of good work in this Big Data field already. Can you please tell me how I can get the dataset or the ruby script to create that dataset. It will be really helpful for me. Thanks in advance :)

      Email : mgyaar@gmail.com

      Warm Regards,
      Mohit Garg

      ReplyDelete
    20. Hi..

      It would be helpful if you could send me the datasets. i am interested in analyzing the data using the visualization tool. Thanks in advance :)
      My mail id is anushriagarwal87@gmail.com

      ReplyDelete
    21. Hi..

      It would be helpful if you could send me the datasets. i am interested in analyzing the data using the visualization tool. Thanks in advance :)
      My mail id is anushriagarwal87@gmail.com

      ReplyDelete
    22. Could you please share the dataset and the script, this is for a student project work. link.shyam@gmail.com Thanks

      ReplyDelete