Data Science in Action [PDF]

Data Science. â« The scientific exploration of data to extract meaning or insight. â« , and the construction of softwa

3 downloads 3 Views 3MB Size

Report

Download PDF

PNG Network

Recommend Stories

science in action

The greatest of richness is the richness of the soul. Prophet Muhammad (Peace be upon him)

PdF Marketing Data Science

Nothing in nature is unbeautiful. Alfred, Lord Tennyson

[PDF] Data Science for Business

You can never cross the ocean unless you have the courage to lose sight of the shore. Andrè Gide

PdF Data Science for Business

Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying, "I will

PDF Download Neural Data Science

If you want to become full, let yourself be empty. Lao Tzu

PdF Introduction to Data Science

Ask yourself: What is my life’s purpose? Am I acting accordingly? Next

[PDF] Data Science for Business

Learn to light a candle in the darkest moments of someone’s life. Be the light that helps others see; i

PdF Python Data Science Handbook

No matter how you feel: Get Up, Dress Up, Show Up, and Never Give Up! Anonymous

[PDF] R for Data Science

The beauty of a living thing is not the atoms that go into it, but the way those atoms are put together.

JBoss in Action Pdf

When you do things from your soul, you feel a river moving in you, a joy. Rumi

Idea Transcript

+

Data Science in Action Peerapon Vateekul, Ph.D. Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University

Chula Data Science

+

2

Outlines 

Data Science & Data Scientist



Data Mining



Analytics with R



A Framework for Big Data Analytics

Chula Data Science

+ Data Science & Data Scientist

3

Chula Data Science

+

5

What is Data Science? 

Data 



Science 



Facts and statistics collected together for reference or analysis

A systematic study through observation and experiment

Data Science 

The scientific exploration of data to extract meaning or insight



, and the construction of software to utilize such insight in a business context. Data Preparation

Chula Data Science

Data Analysis

Data Visualization

Data Product

+

6

What is Data Science? (cont.) 

Transform data into valuable insights



Transform data into data products



Transform data into interesting stories

Chula Data Science

+

7

What is Data Science? (cont.) Transform data into valuable insights

Social Influence in Social Advertising: Evidence from Field Experiments (Bakshy et al. 2012)

Chula Data Science

+

8

What is Data Science? (cont.) Transform data into data products

Service Recommendation

Chula Data Science

+

9

What is Data Science? (cont.) Transform data into data products Fraud Detection

Chula Data Science

+

10

What is Data Science? (cont.) Transform data into data products Email Classification

Spam Detection

Chula Data Science

+

11

What is Data Science? (cont.) Transform data into interesting stories

Chula Data Science

+

12

What is Data Science? (cont.) Transform data into interesting stories

Chula Data Science

+

13

Data Science: Famous Definition

Chula Data Science

+

14

Data Science: Components Domain Expertise Statistics

Visualization

Chula Data Science

Data Engineering

Data Science Advanced Computing

+

15

Data Science Process: Iterative Activity

Chula Data Science

+

16

Data Science Tasks 2 3

1

Chula Data Science

+

17

Data Science with Big Data 

Very large raw data sets are now available:  Log files  Sensor data  Sentiment information



With more raw data, we can build better models with improved predictive performance.



To handle the larger datasets we need a scalable processing platform like Hadoop and YARN

Chula Data Science

+

18

Who builds these systems?

Data Scientist:

By Thomas H. Davenport and D.J. Patil From the October 2012 issue

Chula Data Science

19

It is estimated that by 2018, US could have a shortage of 140,000+ people with advanced analytical skills! Chula Data Science

+

20

Definition Business Person

Mathematician

Computer Scientist 

Data collection systems



Statistical models



Domain expertise



Machine learning algorithms



Evaluation metrics



Knowing what questions to ask



Predictive analytics 

Interpreting results for business decisions



Presenting outcomes



Interface design 



Design/manage/query database



Data aggregation



Data mining

Chula Data Science

Data visualization

+

21

Needed Skills 

Applied Science

Business Analysis



Statistics, applied math



Data Analysis, BI



Machine Learning, Data Mining



Business/domain expertise



Tools: SQL, Excel, EDW







Tools: Python, R, SAS, SPSS

Data engineering



Big data engineering



Database technologies



Big data technologies



Computer science





Tools: Java, Scala, Python, C++

Statistics and machine learning over large datasets



Tools: Hadoop, PIG, HIVE, Cascading, SOLR, etc.

Chula Data Science

+

22

The Data Science Team

Chula Data Science

+ Data Mining

23

Chula Data Science

+

24

What is Data Mining (DM)? 

An automatic process of



discovering useful information



in large data repositories



with sophisticated algorithm

Statistics

Machine Learning

Data Mining

Database systems

Chula Data Science

+

25

Data Mining Tasks

Chula Data Science



Predictive Task (Supervised Learning)  Classification  Regression



Descriptive Task (Unsupervised Learning)  Clustering  Association Rules Mining  Sequence Analysis



Other:  Collaborative filtering: (recommendations engine) uses techniques from both supervised and unsupervised world.

+ Supervised Learning: learning from target

26

Training dataset: 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1

Test dataset: 71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0

Chula Data Science

?

0 1 0 1 1 0 1 0 0 1 0

+

27

Classification: predicting a category



Age

Salary

Predict targeted customers who tend to buy our product (yes/no) Chula Data Science

Some techniques: 

Naïve Bayes



Decision Tree



Logistic Regression



Support Vector Machines



Neural Network



Ensembles

+

28

Regression: predict a continuous value



Predict a sale price of each house

Chula Data Science

Some techniques: 

Linear Regression / GLM



Decision Trees



Support vector regression



Neural Network



Ensembles

+

Predictive Modeling Applications Database marketing Financial risk management

Fraud detection Pattern detection

Chula Data Science

+ Unsupervised Learning: detect natural

30

patterns Training dataset: 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1

Test dataset: 71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0

Chula Data Science

?

0 1 0 1 1 0 1 0 0 1 0

+

31

Clustering: detect similar instance groupings 

Chula Data Science

Some techniques: 

k-means



Spectral clustering



DB-scan



Hierarchical clustering

Example: Customer Segmentation

Association Rule Discovery TID

Items

1

Bread, Coke, Milk

2 3

Beer, Bread Beer, Coke, Diaper, Milk

4 5

Beer, Bread, Diaper, Milk Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Store layout design/promotion Chula Data Science

32

Product recommendation: predicting “preference”

Chula Data Science

33

+ Analytics with R

34

Chula Data Science

+

35

What is R? 

R is a free software environment for statistical computing and graphics.



R can be easily extended with 5,800+ packages available on CRAN (as of 13 Sept 2014).



Many other packages provided on Bioconductor, R-Forge, GitHub, etc.



R manuals on CRAN

Chula Data Science

+

36

Why R? 

R is widely used in both academia and industry.



R was ranked no. 1 in the KDnuggets 2014 poll on Top Languages for analytics, data mining, data science (actually, no. 1 in 2011, 2012 & 2013!).



The CRAN Task Views 9 provide collections of packages for different tasks.

Chula Data Science

+

37

Classification with R 

Decision trees: rpart, party



Random forest: randomForest, party



SVM: e1071, kernlab



Neural networks: nnet, neuralnet, RSNNS



Performance evaluation: ROCR

+

38

Clustering with R 

k-means: kmeans(), kmeansruns()



k-medoids: pam(), pamk()



Hierarchical clustering: hclust(), agnes(), diana()



DBSCAN: fpc



BIRCH: birch

Chula Data Science

+

39

Association Rule Mining with R 

Association rules: apriori(), eclat() in package arules



Sequential patterns: arulesSequence



Visualization of associations: arulesViz

Chula Data Science

+

40

Text Mining with R 

Text mining: tm



Topic modelling: topicmodels, lda



Word cloud: wordcloud



Twitter data access: twitteR

Chula Data Science

+

41

Time Series Analysis with R 

Time series decomposition: decomp(), decompose(), arima(), stl()



Time series forecasting: forecast



Time Series Clustering: TSclust



Dynamic Time Warping (DTW): dtw

Chula Data Science

+

42

Social Network Analysis with R 

Packages: igraph, sna



Centrality measures: degree(), betweenness(), closeness(), transitivity()



Clusters: clusters(), no.clusters()



Cliques: cliques(), largest.cliques(), maximal.cliques(), clique.number()



Community detection: fastgreedy.community(), spinglass.community()

Chula Data Science

+

43

R and Big Data 

Hadoop  



Spark  



Spark - a fast and general engine for large-scale data processing, which can be 100 times faster than Hadoop SparkR - R frontend for Spark

H2O  



Hadoop (or YARN) - a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models R Packages: RHadoop, RHIPE

H2O - an open source in-memory prediction engine for big data science R Package: h2o

MongoDB  

MongoDB - an open-source document database R packages: rmongodb, RMongo

Chula Data Science

+

A Framework for Big Data Analytics

44

Chula Data Science

+

45

Big Data Analytics: Components

Tool: Hadoop

Chula Data Science

R Tools: RHadoop, H2O

Tool: R

+

46

RHadoop

Chula Data Science

+

47

H2O

1 2 3 4

Chula Data Science

• Regression

• Classification • Clustering • Others: Recommendation, Time Series

+

Big data & Analytic Architecture

Cloudera

Hive SQL Query

Zoo Keeper Co-ordination , Management

Client Access R Hadoop

YARN (Map Reduce V.2) Distributed Processing Framework YARN Resource Manager HDFS Hadoop Distributed File System

YARN enables multiple processing applications Chula Data Science

H2O

Data Processing (Batch Processing) Resource Management Data Storage

+

Program List Language

Management

Hadoop Ecosystem

Analytic

HDFS YARN

JAVA Cloudera R

HIVE Zoo Keeper

Chula Data Science

RHadoop RStudio Server H2O

+

50

Use Case: Predict Airline Delays 

Every year approximately 20% of airline flights are delayed or cancelled, resulting in significant costs to both travelers and airlines.



Datasets:





Airline delay data (1987-2008)



http://stat-computing.org/dataexpo/2009/



12 GB!

Goal: 

Predict delay (delayTime >= 15 mins) in flights

Chula Data Science

+ Thank you & Any questions?

51

Chula Data Science

Recommend Stories

Idea Transcript

Helpful Links

Smile Life

Get in touch