Data Science in Action [PDF]

Data Science. ▫ The scientific exploration of data to extract meaning or insight. ▫ , and the construction of softwa

3 downloads 3 Views 3MB Size

Recommend Stories


science in action
The greatest of richness is the richness of the soul. Prophet Muhammad (Peace be upon him)

PdF Marketing Data Science
Nothing in nature is unbeautiful. Alfred, Lord Tennyson

[PDF] Data Science for Business
You can never cross the ocean unless you have the courage to lose sight of the shore. Andrè Gide

PdF Data Science for Business
Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying, "I will

PDF Download Neural Data Science
If you want to become full, let yourself be empty. Lao Tzu

PdF Introduction to Data Science
Ask yourself: What is my life’s purpose? Am I acting accordingly? Next

[PDF] Data Science for Business
Learn to light a candle in the darkest moments of someone’s life. Be the light that helps others see; i

PdF Python Data Science Handbook
No matter how you feel: Get Up, Dress Up, Show Up, and Never Give Up! Anonymous

[PDF] R for Data Science
The beauty of a living thing is not the atoms that go into it, but the way those atoms are put together.

JBoss in Action Pdf
When you do things from your soul, you feel a river moving in you, a joy. Rumi

Idea Transcript


+

Data Science in Action Peerapon Vateekul, Ph.D. Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University

Chula Data Science

+

2

Outlines 

Data Science & Data Scientist



Data Mining



Analytics with R



A Framework for Big Data Analytics

Chula Data Science

+ Data Science & Data Scientist

3

Chula Data Science

+

5

What is Data Science? 

Data 



Science 



Facts and statistics collected together for reference or analysis

A systematic study through observation and experiment

Data Science 

The scientific exploration of data to extract meaning or insight



, and the construction of software to utilize such insight in a business context. Data Preparation

Chula Data Science

Data Analysis

Data Visualization

Data Product

+

6

What is Data Science? (cont.) 

Transform data into valuable insights



Transform data into data products



Transform data into interesting stories

Chula Data Science

+

7

What is Data Science? (cont.) Transform data into valuable insights

Social Influence in Social Advertising: Evidence from Field Experiments (Bakshy et al. 2012)

Chula Data Science

+

8

What is Data Science? (cont.) Transform data into data products

Service Recommendation

Chula Data Science

+

9

What is Data Science? (cont.) Transform data into data products Fraud Detection

Chula Data Science

+

10

What is Data Science? (cont.) Transform data into data products Email Classification

Spam Detection

Chula Data Science

+

11

What is Data Science? (cont.) Transform data into interesting stories

Chula Data Science

+

12

What is Data Science? (cont.) Transform data into interesting stories

Chula Data Science

+

13

Data Science: Famous Definition

Chula Data Science

+

14

Data Science: Components Domain Expertise Statistics

Visualization

Chula Data Science

Data Engineering

Data Science Advanced Computing

+

15

Data Science Process: Iterative Activity

Chula Data Science

+

16

Data Science Tasks 2 3

1

Chula Data Science

+

17

Data Science with Big Data 

Very large raw data sets are now available:  Log files  Sensor data  Sentiment information



With more raw data, we can build better models with improved predictive performance.



To handle the larger datasets we need a scalable processing platform like Hadoop and YARN

Chula Data Science

+

18

Who builds these systems?

Data Scientist:

By Thomas H. Davenport and D.J. Patil From the October 2012 issue

Chula Data Science

19

It is estimated that by 2018, US could have a shortage of 140,000+ people with advanced analytical skills! Chula Data Science

+

20

Definition Business Person

Mathematician

Computer Scientist 

Data collection systems



Statistical models



Domain expertise



Machine learning algorithms



Evaluation metrics



Knowing what questions to ask



Predictive analytics 

Interpreting results for business decisions



Presenting outcomes



Interface design 



Design/manage/query database



Data aggregation



Data mining

Chula Data Science

Data visualization

+

21

Needed Skills 

Applied Science

Business Analysis



Statistics, applied math



Data Analysis, BI



Machine Learning, Data Mining



Business/domain expertise



Tools: SQL, Excel, EDW







Tools: Python, R, SAS, SPSS

Data engineering



Big data engineering



Database technologies



Big data technologies



Computer science





Tools: Java, Scala, Python, C++

Statistics and machine learning over large datasets



Tools: Hadoop, PIG, HIVE, Cascading, SOLR, etc.

Chula Data Science

+

22

The Data Science Team

Chula Data Science

+ Data Mining

23

Chula Data Science

+

24

What is Data Mining (DM)? 

An automatic process of



discovering useful information



in large data repositories



with sophisticated algorithm

Statistics

Machine Learning

Data Mining

Database systems

Chula Data Science

+

25

Data Mining Tasks

Chula Data Science



Predictive Task (Supervised Learning)  Classification  Regression



Descriptive Task (Unsupervised Learning)  Clustering  Association Rules Mining  Sequence Analysis



Other:  Collaborative filtering: (recommendations engine) uses techniques from both supervised and unsupervised world.

+ Supervised Learning: learning from target

26

Training dataset: 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1

Test dataset: 71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0

Chula Data Science

?

0 1 0 1 1 0 1 0 0 1 0

+

27

Classification: predicting a category



Age

Salary

Predict targeted customers who tend to buy our product (yes/no) Chula Data Science

Some techniques: 

Naïve Bayes



Decision Tree



Logistic Regression



Support Vector Machines



Neural Network



Ensembles

+

28

Regression: predict a continuous value



Predict a sale price of each house

Chula Data Science

Some techniques: 

Linear Regression / GLM



Decision Trees



Support vector regression



Neural Network



Ensembles

+

Predictive Modeling Applications Database marketing Financial risk management

Fraud detection Pattern detection

Chula Data Science

+ Unsupervised Learning: detect natural

30

patterns Training dataset: 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1

Test dataset: 71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0

Chula Data Science

?

0 1 0 1 1 0 1 0 0 1 0

+

31

Clustering: detect similar instance groupings 

Chula Data Science

Some techniques: 

k-means



Spectral clustering



DB-scan



Hierarchical clustering

Example: Customer Segmentation

Association Rule Discovery TID

Items

1

Bread, Coke, Milk

2 3

Beer, Bread Beer, Coke, Diaper, Milk

4 5

Beer, Bread, Diaper, Milk Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Store layout design/promotion Chula Data Science

32

Product recommendation: predicting “preference”

Chula Data Science

33

+ Analytics with R

34

Chula Data Science

+

35

What is R? 

R is a free software environment for statistical computing and graphics.



R can be easily extended with 5,800+ packages available on CRAN (as of 13 Sept 2014).



Many other packages provided on Bioconductor, R-Forge, GitHub, etc.



R manuals on CRAN

Chula Data Science

+

36

Why R? 

R is widely used in both academia and industry.



R was ranked no. 1 in the KDnuggets 2014 poll on Top Languages for analytics, data mining, data science (actually, no. 1 in 2011, 2012 & 2013!).



The CRAN Task Views 9 provide collections of packages for different tasks.

Chula Data Science

+

37

Classification with R 

Decision trees: rpart, party



Random forest: randomForest, party



SVM: e1071, kernlab



Neural networks: nnet, neuralnet, RSNNS



Performance evaluation: ROCR

+

38

Clustering with R 

k-means: kmeans(), kmeansruns()



k-medoids: pam(), pamk()



Hierarchical clustering: hclust(), agnes(), diana()



DBSCAN: fpc



BIRCH: birch

Chula Data Science

+

39

Association Rule Mining with R 

Association rules: apriori(), eclat() in package arules



Sequential patterns: arulesSequence



Visualization of associations: arulesViz

Chula Data Science

+

40

Text Mining with R 

Text mining: tm



Topic modelling: topicmodels, lda



Word cloud: wordcloud



Twitter data access: twitteR

Chula Data Science

+

41

Time Series Analysis with R 

Time series decomposition: decomp(), decompose(), arima(), stl()



Time series forecasting: forecast



Time Series Clustering: TSclust



Dynamic Time Warping (DTW): dtw

Chula Data Science

+

42

Social Network Analysis with R 

Packages: igraph, sna



Centrality measures: degree(), betweenness(), closeness(), transitivity()



Clusters: clusters(), no.clusters()



Cliques: cliques(), largest.cliques(), maximal.cliques(), clique.number()



Community detection: fastgreedy.community(), spinglass.community()

Chula Data Science

+

43

R and Big Data 

Hadoop  



Spark  



Spark - a fast and general engine for large-scale data processing, which can be 100 times faster than Hadoop SparkR - R frontend for Spark

H2O  



Hadoop (or YARN) - a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models R Packages: RHadoop, RHIPE

H2O - an open source in-memory prediction engine for big data science R Package: h2o

MongoDB  

MongoDB - an open-source document database R packages: rmongodb, RMongo

Chula Data Science

+

A Framework for Big Data Analytics

44

Chula Data Science

+

45

Big Data Analytics: Components

Tool: Hadoop

Chula Data Science

R Tools: RHadoop, H2O

Tool: R

+

46

RHadoop

Chula Data Science

+

47

H2O

1 2 3 4

Chula Data Science

• Regression

• Classification • Clustering • Others: Recommendation, Time Series

+

Big data & Analytic Architecture

Cloudera

Hive SQL Query

Zoo Keeper Co-ordination , Management

Client Access R Hadoop

YARN (Map Reduce V.2) Distributed Processing Framework YARN Resource Manager HDFS Hadoop Distributed File System

YARN enables multiple processing applications Chula Data Science

H2O

Data Processing (Batch Processing) Resource Management Data Storage

+

Program List Language

Management

Hadoop Ecosystem

Analytic

HDFS YARN

JAVA Cloudera R

HIVE Zoo Keeper

Chula Data Science

RHadoop RStudio Server H2O

+

50

Use Case: Predict Airline Delays 

Every year approximately 20% of airline flights are delayed or cancelled, resulting in significant costs to both travelers and airlines.



Datasets:





Airline delay data (1987-2008)



http://stat-computing.org/dataexpo/2009/



12 GB!

Goal: 

Predict delay (delayTime >= 15 mins) in flights

Chula Data Science

+ Thank you & Any questions?

51

Chula Data Science

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.