I'm training some of my colleagues on Big'ish data analysis this week. Here's how I'm running the class. Would love your ideas to make it better.

After completion of the course, you will be able to:

We will pick a tool to learn the concepts of data science. We will use R, a leading open source stats package. Why I started learning data science and picked R

Curriculum for Intro to R (R has steep learning curve. Purpose of this discussion is to get you started)

**CLASS OBJECTIVES (LEARNING OUTCOMES)**After completion of the course, you will be able to:

- Understand concepts of data science, related processes, tools, techniques and path to building expertise
- Use Unix command line tools for file processing (awk, sort, paste, join, gunzip, gzip)
- Use Excel to do basic analysis and plots
- Write and understand R code (data structures, functions, packages, etc.)
- Explore a new dataset with ease (visualize it, summarize it, slice/dice it, answer questions related to dataset)
- Plot charts on a dataset using R

**CLASS PREREQUISITES**- Good knowledge of basic statistics (min, max, avg, sd, variance, factors, quantiles/deciles, etc.)
- Familiarity with Unix OS

**CLASS TOPICS****A) Intro to data science**- Explain data science and its importance. Data-driven business functions e.g. MROI, mix optimization, IPL teams / fantasy teams, predictions
- Big data

- Definition: Data sets that no longer fit on a disk, requiring compute clusters and respective software and algorithms (map/reduce running on Hadoop).

- Real big data problems: parallel computing, distributed computing, cloud, hadoop, casandra

- Most analysis isn't Big Data. Business apps often deal with datasets that fit in Excel/Access - Products: Desktop tools (Excel (solver, what if), Access, SQL, spss, stata, R, sas, programming languages (ruby, python, java) -- stats libs in these languages, BI tools, etc.

**B) Steps in data science**- Acquire data: "obtaining the data"... databases, log files... exports, surveys, web scraping etc.
- Verify data
- Cleanse and transform data: outliers, missing values, dedupe, merge
- Explore data: The first step when dealing with a new data set needs to be exploratory in nature: what actually is in the data set? Summarize, Visually inspect entire data

- What does the data look like? summaries, cross-tabulation

- What does knowing one thing tell me about another? Relationships between data elements

- What the heck is going on? - Visualize data
- Interact with data (not covered here): BI tools, custom dashboards, other tools (ggobi etc.)
- Archive data (not covered here)

**C) Skills needed for data science**- Statistics: Concepts, approach, techniques
- Databasing: SQL
- Scripting language: Ruby, Python
- RegEx
- Visual design: Story telling with charts
- File handling: Unix preferred. awk, gzip, gunzip, paste, sort etc.
- Office tools: Excel (plugins like Solver, What If)
- Statistical tools: R, SAS, SPSS, Stata, MATLAB, etc.
- BI tools: Qlikview, Tableau

**D) Learning R**We will pick a tool to learn the concepts of data science. We will use R, a leading open source stats package. Why I started learning data science and picked R

**E) Where to go from here?**- Learn adv techniques: sampling, predictions. Books, Conferences
- Analyse your favorite dataset: e.g. Cricket data analysis
- Compete (kaggle)
- Learn other tools (Excel Solver, SAS etc.)

**REFERENCE****Tutorials**- Stats202 class
- UCLA's mini course on R
- R intro
- R fundamentals
- R data import/export
- R-bloggers
- Web app integration
- RTips
- TBD

**Books**- TBD

This looks great...wish I could take the class!

ReplyDeleteErin

Very good summary,

ReplyDeleteOne minor comment: after B.4) do you plan to give an intro intro basic modeling techniques (like regression etc. using R)?

Thanks Erin and Daniel. I decided to keep clustering, decision trees and regressions for the next class... Stay tuned...

ReplyDeleteQuestion for you - what BI tools will you cover?

ReplyDelete@mike, we discussed mid-tier BI tools like Qlikview and Tableau in passing (no details, just introduced the area). Briefly discussed enterprise-grade tools like Cognos and Business Objects too...

ReplyDeleteHi prasoonsharma, do you plan to take on students online? I'd love to join your class and learn more about R. Would you? (:

ReplyDeleteHi Prasoon,

ReplyDeleteI do not have the privilege to attend your class...just that I have a fairly good system and my father

is a mathematician...request you to tell me how can I learn big data...I want to be an expert in two

years.

Thanks,

Sujeet

Hi Sujeet, perhaps the best way to start is to sign-up for Stats 202 class. Stanford teaches Stats using Intro to Data Mining book and R software. Check it out http://stats202.com

ReplyDeleteBuy the book on Amazon: http://www.amazon.com/Introduction-Data-Mining-Pang-Ning-Tan/dp/0321321367

Read more about the book on Kumar's page: http://www-users.cs.umn.edu/~kumar/dmbook/index.php

Nice Info.

ReplyDeleteThe number of points and ideas given here are quite considerable and up to the level to my considerations, hopefully this cold proved to be great guide, if further information could be taken from the same platform.

ReplyDelete