Many of my software engineer friends ask me about learning data science. There are many articles on this subject from renowned data scientists (Dataspora, Gigaom, Quora, Hilary Mason). This post captures my journey (a software engineer) on learning Statistics and Data Visualization.

###
**1. GETTING STARTED**

###
**a) Self-learning (2 - 4 months)**

*Explore if data science is for you*This is the key to getting started. Two years ago some of us at work formed a study group to review Stats 202 class material. This is what got me excited and started with data analytics. Only 2 of the 5 members of our study group chose to dive deeper into this field (data science is not for everyone).

**Learn basic statistics**: Stats 202 coursework is perfect for this**Learn a statistical tool**: I spent 3 months heads-down learning R as a new-bee and had the most fun doing so. Why learn R?**Solve toy problems**: Curiosity is key to data science. If you've questions about your country's economy, crime stats, sports performance, get the data and start answering your questions**Learn Unix tools**: I picked O'Reilly's Data Analysis with Open Source Tools (A hands-on guide for programmers and data scientists) book to read.**Learn SQL and scripting languages**: I know Java, Ruby and SQL. Python is on my list.

There's a lot of training material available online

- Stats 202
- Caltech Data Science course
- Coursera: Introduction to Data Science, Machine learning, Data Analysis, Computing for Data Analysis
- University of California Berkeley - Introduction to Data Science
- Knight Center for Journalism's course on Introduction to Infographics and Data Visualization
- Stats 101: Udacity (Intro to Stats), Khan academy, Carnegie Mellon's stats course
- Learn R

###
**b) Class-room training (9 - 12 months)**

*If you're serious about learning, enroll into a formal program*If you're serious about picking this skill, then opt for a course. The rigor of the class ensured that I didn't slack. Stanford offers great coursework to get started. They are far superior compared to many week-long training courses I've been to...

- Data Mining and Analysis STATS202
- Linear and Nonlinear Optimization MS&E211
- Mining Massive Data Sets CS246
- Modern Applied Statistics: Learning STATS315A
- Statistical Methods in Finance STATS240P
- Modern Applied Statistics: Data Mining STATS315B

###
**2. GETTING FOCUSED**

###
**a) Spend 100% of my time on data science**

- Once I was hooked on data science, it was difficult to spend only 20% of my time on it to build expertise. I needed to spend 100% of my time on it, so I found work problems related to data science (big data analysis, healthcare, marketing & sales and retail analytics, optimization problems).

###
**b) Work on interesting problems**

- I aligned my learning goal with my passion. I found it energizing and engaging to solve interesting problems while learning new techniques. I was interested in retail, healthcare and sports (cricket) data analysis.

###
**c) Accelerate learning: **

**Teach:**I taught R and data mining introductory classes to colleagues and friends. This helped me reinforce my learning and get others excited on this topic. This is also a great way for me to give back to the open source community. Blogging is another medium to contribute and learn**Follow the****leaders**in data science and n**etwork with data scientists**: DJ Patil, Hillary Mason, Jeff Hammerbacher, Carla Gentry, Monica Rogati, Cathy O'Neil. There are many others in this space. Apologies for missing out many of them. These are the people I look up to.**Follow interesting blogs**: http://datascience101.wordpress.com, http://columbiadatascience.com/blog, http://www.r-bloggers.com, http://www.datawrangling.com, http://flowingdata.com (Quora's best data blog list)**Attend conferences/meetups periodically**: Local data science/R meetups, O'Reilly Strata is great! Given how rapidly this field is evolving, I go there at least every other year. UseR is wonderful to see what's happening in the world of R**Learn Big Data techniques**: MapReduce/Hadoop, Cloud computing. I avoided picking any commercial, vendor technology and in retrospect, it was a good decision.

###
**d) Learn business domains**

I'm lucky to have access to internal and external experts in data science, and they've helped me understand their approach to data science problems (how they think, hypothesize and test/assess/reject solutions). I've learned from them the importance of "Hypothesis-driven data analysis" rather than "blind/brute-force data analysis". This highlighted the importance of understanding the business domains really well before trying to extract meaningful insights from the data. This led me to understand operations research and marketing topics, retail, travel & logistics (revenue management) and healthcare industries. NY Times recently published an article highlighting the need for intuition.

### 3. DATA SCIENCE BOOKS I FOUND USEFUL

- Introduction to Data Mining by Tan, Steinback and Kumar This is the textbook used in many introductory data science courses, including Stats 202 at Stanford. Great guide to keep handy
- R in a nutshell
- Data Analysis by using Open Source tools
- Beautiful visualization
- See more books on data science: O'Reilly, Manning

### 4. WHAT DIDN'T WORK FOR ME

**Learning multiple Statistical tools:**A year ago, I started getting some work requests for SAS programming, so I wanted to learn it. I tried to learn it for a month or so but could not do it. The main reason was learning inertia and my love for the statistical tool I knew already - R. I really didn't need another statistical tool. I could solve most of my data science problems with R and other software tools I knew. So my advice is that if you already know SAS, Stata, Matlab, SPSS, Statistica very well, stick to it. However if you're learning a new statistical tool, pick R. R is open source while most others are commercial software (expensive and complex).

**Auditing courses:**I tried to follow self-paced coursework from Coursera and other MOOCs but it wasn't effective for me. I needed the routine, the pressure of a formal course with proper grading to go through the rigor

**Increasing academic workload:**Manage work-life balance and work-commitments well. Earlier this year, I tried to take multiple difficult courses at the same time and quickly realized that I wasn't enjoying and learning as I should.**Sticking to course text book only**: Many of the books in these classes are too "dense" for me (a software engineer). So I used other material to understand the concepts. E.g. regression from Carnegie Mellon notes

Comments, questions, suggestions are welcome!

VERY HELPFUL. THANKS. HOW ABOUT COLUMBIA UNIVERSITY'S COURSE?

ReplyDeleteColumbia has a great program on data science. See the list of their coursework in Stats dept here -> http://statistics.columbia.edu/content/course-descriptions. Checkout the notes from their Intro to Data Science course here -> http://columbiadatascience.com.

ReplyDeleteIn fact, there are many US universities offering data science programs now

ReplyDeleteInstead, I need a "Statistician's guide to getting started with software engineering".

ReplyDeleteI'm in the same boat. I have the business acumen and stats background;just need to get more comfortable with SQL and Python to round out my programming skills.

DeleteFortunately, i think that most of these recommendations apply, with a heavier emphasis on the programming languages up-front.

I'm hoping someone with Stats background can list their experience of picking up software engineering skills to serve as "Statistician's guide to getting started with software engineering." Here's what I've taught my colleagues in the past for data science training. It might be helpful to review it.

ReplyDeleteI'm teaching another batch in January 2013 in Mumbai and will post the curriculum and my notes soon.

Awaiting your post.

ReplyDeleteHi Kaizer, I will post the training agenda soon. The trip to Mumbai was productive and hectic. We identified 10 engineers from the batch of 20 to work with our team on data analysis.

ReplyDeleteAs I reflect on some books I've read, I realize that some stats books were too "dense" for me, a software engineer. The trick that worked for me was to look for alternative text. The book/text that I liked was often not the one suggested/listed by course professor. Often the course material/lecture notes on Carnegie Melon, MIT, Columbia courses are helpful.

ReplyDeletehow could we access stanford class without an sunet id? Thanks

ReplyDeleteThanks for all this great information. It is helping me develop my plan. I especially like the part about you finding problems at work that let you use your DS skills... I am doing that too :)

ReplyDeleteGreat helpful information. Thanks for sharing.

ReplyDelete@sas study, I don't think its possible to see Stanford class material without sunet id. However, many of the lecture notes and tests are available on course websites (public).

ReplyDeleteThanks @Lillian and @Gaurav. Here's a vendor offered course that I found helpful in the last few months -- Clourdera's Hadoop developer, Pig/Hive training.

ReplyDeleteGood structured information. The sequence is truly helpful. Thanks for the post. Stanford is offering its STATS202 course as part of a resident summer session program (June-July-August). Now I am stuck between undertaking this class at Stanford Summer Session vs undertaking their online Data Mining & Applications certificate program. Can you provide some inputs on this? Thanks.

ReplyDelete@Pronojit, nothing like taking Stats 202 on the campus. Go for it! The interaction in the class and team projects will make it much more enriching. cheers!

ReplyDeleteThanks for that input. Apart from the STATS 202 what would be some of the good sources to dive deeper into Data Science in San Francisco area?

DeleteGreat information that has helped me plan my training program. When are you teaching next?

ReplyDeleteThank you for all the information, very helpful. I was looking at this program at Stanford : Mining Massive Data Sets Graduate Certificate. Do you know if they offer any placement assistance after completing the program?

ReplyDeleteStanford has multiple, effective ways to assist with placement. Best to call the graduate office to get this information.

ReplyDeletethank u sir, I am 12 pass out, can u please tell me will b.sc in statistics help in becoming a data scientist?

ReplyDeleteI have no experience with coursework in B.Sc Stats, so very difficult to comment. Perhaps best to tailor your coursework to include key skills needed for data science - stats + computer science

ReplyDelete