Data Science in Action [PDF]

The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environme

0 downloads 5 Views 7MB Size

Recommend Stories


science in action
The greatest of richness is the richness of the soul. Prophet Muhammad (Peace be upon him)

PdF Marketing Data Science
Nothing in nature is unbeautiful. Alfred, Lord Tennyson

[PDF] Data Science for Business
You can never cross the ocean unless you have the courage to lose sight of the shore. Andrè Gide

PdF Data Science for Business
Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying, "I will

PDF Download Neural Data Science
If you want to become full, let yourself be empty. Lao Tzu

PdF Introduction to Data Science
Ask yourself: What is my life’s purpose? Am I acting accordingly? Next

[PDF] Data Science for Business
Learn to light a candle in the darkest moments of someone’s life. Be the light that helps others see; i

PdF Python Data Science Handbook
No matter how you feel: Get Up, Dress Up, Show Up, and Never Give Up! Anonymous

[PDF] R for Data Science
The beauty of a living thing is not the atoms that go into it, but the way those atoms are put together.

JBoss in Action Pdf
When you do things from your soul, you feel a river moving in you, a joy. Rumi

Idea Transcript


Data Science in Action Practical use cases that demonstrate how businesses generate value from data 21 March 2014 – SDS2014, Winterthur

© Copyright 2014 Pivotal. All rights reserved.

Introduction to Pivotal Data Labs Our Team

Our Tools

Our Methods

Our Process

Our Experience

Data Sources Analysis

Structured Data •  dbSNP •  refSeq Raw Sequence Data Unstructured Unstructured Data Image Data

Normalized,k=8

Data Warehouse

Visualization/ Quality Reporting

Married 56−65

Married with Children

Females 55 and Over

Females 20−40

Males 56−65

Males 41−55 or 66 and Over

Males 26−40

External Data Sources

num_dependents_0 male_mem_1 sub_gender_m female_mem_0 avg_mem_age_56_65 sub_age_56_65 over_avg_age sub_age_0_17 avg_mem_age_0_17 avg_mem_age_41_55 sub_age_41_55 num_dependents_1 sub_age_18_25 avg_mem_age_18_25 num_dependents_3 num_dependents_2 sub_benifit_type_i_cur num_dependents_4 male_mem_4 male_mem_3 female_mem_4 female_mem_3 male_mem_2 female_mem_2 sub_region_1 sub_age_66_plus avg_mem_age_66_plus ca_resident sub_region_2 sub_region_3 male_mem_0 female_mem_1 sub_gender_f under_avg_age sub_age_26_40 avg_mem_age_26_40 mem_clm_cnt_0 urban sub_region_4 rural

Males Under 25

P(A|B) = !

• High caliber global team of machine learning experts from a wide variety of quantitative backgrounds • Equally capable in coding & statistics

© Copyright 2014 Pivotal. All rights reserved.

• Leading edge tools to implement machine learning collaboratively • Have open-sourced several of our own tools for wide-spread use

P(B|A) P(A)! P(B) !

• Parallelized a wide variety of machine learning algorithms for optimum performance on the Pivotal platform • Agile, test-driven, customer focused

• Analytical workflow aligned with business needs and optimized for speed • Supports iterative and collaborative working

• More than 100 customer assignments carried out in the past 18 months • Ensures quality and best practice in all our assignments

Pivotal Data Science Team By Degree 4%

By University

By Subject

Bachelor

Other Top 25 18%

Masters 32%

Other

Engineering 11%

7%

Physics 18%

39% 11%

64% 43%

PhD Top 50

© Copyright 2014 Pivotal. All rights reserved.

Computer 28% Science

Math

25%

Statistics & OR

What does the Pivotal Data Science team do?

Deliver Data Science Labs •  Deliver value for customers by applying best practice Data Science •  Kick-off use and spread of the Pivotal platform

Engagement & Enablement •  Enable customers to build on and extend Data Science Labs •  Train and enable customers, partners and Pivotal teams •  Recognized as thought leaders

Data Scientists as catalysts for data-driven transformation © Copyright 2014 Pivotal. All rights reserved.

IP Development •  Develop leading approaches •  Develop and parallelize algorithms •  Develop and file patents

Data Science in Action Ÿ  The Value of Data Science Ÿ  The Practice of Data Science Ÿ  In-depth Use Case: Traffic Prediction Ÿ  Use Case Overviews Ÿ  Q&A © Copyright 2014 Pivotal. All rights reserved.

What do we mean with Data Science? How can we make it happen?

Value of Analytics ($)

What will happen? Why did it happen? What happened?

Prescriptive Analytics

Predictive Analytics

Diagnostic Analytics

Descriptive Analytics

Complexity © Copyright 2014 Pivotal. All rights reserved.

Big Data & Data Science

Decision    =    Data    +    Rules   “Big  Data”   © Copyright 2014 Pivotal. All rights reserved.

Data   Science  

“Big Data”

Opera7onal   Data  

© Copyright 2014 Pivotal. All rights reserved.

Dark  Data  

Commercial  &   Public  Data  

Social  Media  

Combining data sources: Example IPSQ (Quality) Owner: TS Production team Test flags from production line 1 year ~300GB

APDM Owner: TS Production team Full vehicle history including IPST (technical), IPSL (logistics), IPSQ test flags and all test results. 30 years ~TBs

FASTA Owner: Aftersales Dealership electronic tests Identifies early issues with cars >25TB

IQS: Initial Quality Survey from JD Power Owner: R&D Survey responses from new owners after 90 days for approx 1700 vehicles Few thousand lines ~MB

Social Data Owner: R&D Pulling 500MB per day from Twitter

TQP Owner: Supplier management PDFs of parts spec sheets ~ 500GB

© Copyright 2014 Pivotal. All rights reserved.

Generating value from data: Car configurator example

Configurations

Sales

Basic Recommendation Engine

Customers Configurations

Sales

Configurations All car elements: •  Attribute frequencies (colors etc) •  Attribute combination frequencies

Customers For instance: •  Browsing history •  Usage patterns •  Demographic insights

© Copyright 2014 Pivotal. All rights reserved.

Advanced Recommendation Engine

Customers

Sales Ideally: •  Volumes •  Pricing •  By market •  Linkable to configurations

Configurations

Sales

Customers

Yield Optimization Engine

The value of data over time Value of Data ($)

Traditional Systems

“Fast Data”

“Big Data”

Pivotal Data Science Labs

µs © Copyright 2014 Pivotal. All rights reserved.

ms

s

hour

day

month year

yr+

Time

Data Science Ÿ  The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environment to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment. Ÿ  In order to drive automated low latency actions in response to events of interest © Copyright 2014 Pivotal. All rights reserved.

Why do Data Science? Business Transformation

Return on Analytics ($)

Innovation

Expansion “low hanging fruit”

Optimization Time

© Copyright 2014 Pivotal. All rights reserved.

What transforms businesses today? Ÿ Digitization Ÿ Internet of Things Ÿ Pervasive Computing Ÿ Pervasive Connectivity

© Copyright 2014 Pivotal. All rights reserved.

Example: major paradigm shifts in automotive Genesis

Mass Production

Modern Manufacturing

Platform Strategy

What’s Next?

1885

1908

1950s

1980s

2020

Not a horse

Mass availability

Brand proliferation

Globalization

Connected, autonomous vehicles

“You can have any color of car, provided it’s black”

© Copyright 2014 Pivotal. All rights reserved.

You can have any color

You can have anything anywhere

The Connected Car drives innovation Telematics

Stolen vehicle Remote recovery Behaviour diagnosis monitoring Remote Car2X Driver activation assistance solutions Floating Real-time eMobility car data parking info solutions Share my trip

Traffic updates Car search

Social Media

Handsfree telephony Music WiFi streaming hotspot Pay as Online you drive Payment Web games solutions radio Road Environmental Parking tolls browsing space reservation VoD Car sharing

eCommerce © Copyright 2014 Pivotal. All rights reserved.

Responsive Navigation PoIs Next gen Navigation Hybrid Real-time Navigation traffic info Map/PoI updates Vehicle Concierge Fleet Geo Tracking services Management fencing

City toll

Rich media comms Car2X comms

Entertainment

Communication

Data Science in Action Ÿ  The Value of Data Science Ÿ  The Practice of Data Science Ÿ  In-depth Use Case: Traffic Prediction Ÿ  Use Case Overviews Ÿ  Q&A © Copyright 2014 Pivotal. All rights reserved.

Traditional Analytics Process Data Sources Analysis

Structured Data •  Sensors •  Flight recordings

Scheduling Analysis

Data Warehouse

Visualization/ Quality Reporting

Operations Analysis

Time-to-Insights sample

In-memory statistics tool

In-memory optimization tool

forecast © Copyright 2014 Pivotal. All rights reserved.

solution

Augmenting an analytical architecture Data Sources Analysis

Structured Data •  Sensors •  Flight recordings Unstructured Raw Data Data Sequence Unstructured Image Data Geolocation Data Voice Transciption External Data Sources Weather Data Open Gov Data

© Copyright 2014 Pivotal. All rights reserved.

Data Warehouse

In Database Analytics

Visualization/ Quality Reporting

Benefits of a new architecture: •  Eliminates data movement •  Enables rapid data re-processing •  Seamless integration of additional external resources into analyses

Machine Learning and Big Data Getting the whole picture improves predictive power

True positive rate

•  More data from different sources •  Provides a more complete view •  Improves statistics and inference Sensor Data

False positive rate

© Copyright 2014 Pivotal. All rights reserved.

Sensor Data

Flight Records

Sensor Data

Flight Records

Geoloc & Weather

Sensor Data

Flight Records

Geoloc & Weather

Imaging

The main types of use cases in practice Data Mining

Predicting Behavior

Optimization

•  Categorize types (segmentation)

•  Predict churn likelihood

•  Categorize behaviors/usage

•  Predict cross/up sales potential

•  Optimize processes

•  Identify co-occurrences and

•  Predict fraud/waste/abuse

•  Optimize process

associations

likelihood

•  Identify anomalies

•  Predict performance

•  Identify attitudes

•  Predict reliability/quality

•  Resolve entities

•  Make a “recommendation”

© Copyright 2014 Pivotal. All rights reserved.

parameters •  Optimize asset allocation

PIVOTAL DATA SCIENCE TOOLKIT 1

Find Data

Platforms •  Greenplum DB •  Pivotal HD •  Hadoop (other) •  SAS HPA •  AWS

2

3

Run Code

Interfaces •  pgAdminIII •  psql •  psycopg2 •  Terminal •  Cygwin •  Putty •  Winscp

Write Code

Editing Tools •  Vi/Vim •  Emacs •  Smultron •  TextWrangler •  Eclipse •  Notepad++ •  IPython •  Sublime

Languages •  SQL •  Bash scripting •  C •  C++ •  C# •  Java •  Python •  R

© Copyright 2014 Pivotal. All rights reserved.

4

Write Code for Big Data

In-Database •  SQL •  PL/Python •  PL/Java •  PL/R •  PL/pgSQL 5

Hadoop •  Pig •  Hive •  Java •  HAWQ

6

Visualization •  python-matplotlib •  python-networkx •  D3.js •  Tableau

Implement Algorithms

Libraries •  MADlib Java •  Mahout R •  (Too many to list!) Text •  OpenNLP •  NLTK •  GPText C++ •  opencv

Show Results

Python •  numpy •  scipy •  scikit-learn •  Pandas Programs •  Alpine Miner •  Rstudio •  MATLAB •  SAS •  Stata

•  GraphViz •  Gephi •  R (ggplot2, lattice, shiny) •  Excel 7

Collaborate

Sharing Tools •  Chorus •  Confluence •  Socialcast •  Github •  Google Drive & Hangouts

A large and varied tool box!

As Data Scientists, what do we want? Infrastructure Independent •  Open source •  PaaS

Fast & Scalable •  In-

Schema Free •  Hadoop

Real Time •  In-memory

•  SQL, not

database

data grids

Java

analytics

embedded

•  Faster

•  MPP

into the platform

© Copyright 2014 Pivotal. All rights reserved.

Easy to Use

than Hive

Data Science in Action Ÿ  The Value of Data Science Ÿ  The Practice of Data Science Ÿ  In-depth Use Case: Traffic Prediction Ÿ  Use Case Overviews Ÿ  Q&A © Copyright 2014 Pivotal. All rights reserved.

What does traffic data look like?

© Copyright 2014 Pivotal. All rights reserved.

…like this?

© Copyright 2014 Pivotal. All rights reserved.

…or this?

(Note: This is the least offensive topic cluster in our Twitter data!)

© Copyright 2014 Pivotal. All rights reserved.

Velocity by Time of Day

© Copyright 2014 Pivotal. All rights reserved.

Distribution of Velocity over Time

© Copyright 2014 Pivotal. All rights reserved.

Link 1000064869

0.015 0.000

0.005

0.010

density

0.020

0.025

0.030

Velocity Distribution

0

© Copyright 2014 Pivotal. All rights reserved.

50

100

km/h

150

200

Find Velocity Groups Ÿ  Velocity distributions can be fit well with Gaussians Ÿ  An ‘overlay’ of multiple Gaussians is called Gaussian Mixture Model

0.030 0.025 0.020 density

0.015 0.010 0.005

Ÿ  Shapes and positions of Gaussians determine velocity groups

Combined Component 1 Component 2 Component 3 Component 4

0.000

Ÿ  GMM fitting of the velocity distribution is done by ExpectationMaximization algorithm

Link 1000064869

0

50

100

150

km/h

© Copyright 2014 Pivotal. All rights reserved.

200

Gaussian Mixture Model Link 1000064869

1 2 3 4

Combined Component Component Component Component

1 2 3 4

0.020 density

0.000

0.005

0.010

0.015

0.020 density

0.015 0.010 0.005 0.000 0

50

100

150

200

0

50

100

150

200

km/h

km/h

Link 1000064869

0.025

1 2 3 4

density

0.000

0.000

0.005

0.005

0.010

0.010

0.015

0.015

0.020

0.020

0.025

0.030

Combined Component Component Component Component

0.030

Link 1000064869

density

Combined Component Component Component Component

0.025

1 2 3 4

0.025

0.030

Combined Component Component Component Component

0.030

Link 1000064869

0

0

50

© Copyright 2014 Pivotal. All rights reserved.

100

km/h

150

200

50

100

150

km/h

200

0.04 density

Decision Trees

0.00

0.01

Example

0.02

0.03

Combined Component 1 Component 2 Component 3

0

50

100

150

200

250

1 0.47 hour >= 14 < 14 1 0.69

2 0.56 weekday = 1,2,3,4,5

hour < 20

6,7

>= 20 1 0.76

2 0.55

weekday = 1,2,3,4,5

weekday = 1,2,3,4,5 6,7

1 0.81

2 0.63

3 0.85

© Copyright 2014 Pivotal. All rights reserved.

nextlink = −1,100000 6,7

2 0.73

3 0.73

3 0.65

100002 2 0.68

3 0.49

Sneak Peek at our TfL Data Demo Ÿ  Used the freely accessible TfL data for a demo Ÿ  Shows # of active disruptions over different days in London

Ø  Rush hour effects visible Ø  Nights are more quiet, but more disruptions on weekend nights

© Copyright 2014 Pivotal. All rights reserved.

Data Science in Action Ÿ  The Value of Data Science Ÿ  The Practice of Data Science Ÿ  In-depth Use Case: Traffic Prediction Ÿ  Use Case Overviews – not presented Ÿ  Q&A © Copyright 2014 Pivotal. All rights reserved.

BUILT FOR THE SPEED OF BUSINESS

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.