Idea Transcript
Data Science in Action Practical use cases that demonstrate how businesses generate value from data 21 March 2014 – SDS2014, Winterthur
© Copyright 2014 Pivotal. All rights reserved.
Introduction to Pivotal Data Labs Our Team
Our Tools
Our Methods
Our Process
Our Experience
Data Sources Analysis
Structured Data • dbSNP • refSeq Raw Sequence Data Unstructured Unstructured Data Image Data
Normalized,k=8
Data Warehouse
Visualization/ Quality Reporting
Married 56−65
Married with Children
Females 55 and Over
Females 20−40
Males 56−65
Males 41−55 or 66 and Over
Males 26−40
External Data Sources
num_dependents_0 male_mem_1 sub_gender_m female_mem_0 avg_mem_age_56_65 sub_age_56_65 over_avg_age sub_age_0_17 avg_mem_age_0_17 avg_mem_age_41_55 sub_age_41_55 num_dependents_1 sub_age_18_25 avg_mem_age_18_25 num_dependents_3 num_dependents_2 sub_benifit_type_i_cur num_dependents_4 male_mem_4 male_mem_3 female_mem_4 female_mem_3 male_mem_2 female_mem_2 sub_region_1 sub_age_66_plus avg_mem_age_66_plus ca_resident sub_region_2 sub_region_3 male_mem_0 female_mem_1 sub_gender_f under_avg_age sub_age_26_40 avg_mem_age_26_40 mem_clm_cnt_0 urban sub_region_4 rural
Males Under 25
P(A|B) = !
• High caliber global team of machine learning experts from a wide variety of quantitative backgrounds • Equally capable in coding & statistics
© Copyright 2014 Pivotal. All rights reserved.
• Leading edge tools to implement machine learning collaboratively • Have open-sourced several of our own tools for wide-spread use
P(B|A) P(A)! P(B) !
• Parallelized a wide variety of machine learning algorithms for optimum performance on the Pivotal platform • Agile, test-driven, customer focused
• Analytical workflow aligned with business needs and optimized for speed • Supports iterative and collaborative working
• More than 100 customer assignments carried out in the past 18 months • Ensures quality and best practice in all our assignments
Pivotal Data Science Team By Degree 4%
By University
By Subject
Bachelor
Other Top 25 18%
Masters 32%
Other
Engineering 11%
7%
Physics 18%
39% 11%
64% 43%
PhD Top 50
© Copyright 2014 Pivotal. All rights reserved.
Computer 28% Science
Math
25%
Statistics & OR
What does the Pivotal Data Science team do?
Deliver Data Science Labs • Deliver value for customers by applying best practice Data Science • Kick-off use and spread of the Pivotal platform
Engagement & Enablement • Enable customers to build on and extend Data Science Labs • Train and enable customers, partners and Pivotal teams • Recognized as thought leaders
Data Scientists as catalysts for data-driven transformation © Copyright 2014 Pivotal. All rights reserved.
IP Development • Develop leading approaches • Develop and parallelize algorithms • Develop and file patents
Data Science in Action The Value of Data Science The Practice of Data Science In-depth Use Case: Traffic Prediction Use Case Overviews Q&A © Copyright 2014 Pivotal. All rights reserved.
What do we mean with Data Science? How can we make it happen?
Value of Analytics ($)
What will happen? Why did it happen? What happened?
Prescriptive Analytics
Predictive Analytics
Diagnostic Analytics
Descriptive Analytics
Complexity © Copyright 2014 Pivotal. All rights reserved.
Big Data & Data Science
Decision = Data + Rules “Big Data” © Copyright 2014 Pivotal. All rights reserved.
Data Science
“Big Data”
Opera7onal Data
© Copyright 2014 Pivotal. All rights reserved.
Dark Data
Commercial & Public Data
Social Media
Combining data sources: Example IPSQ (Quality) Owner: TS Production team Test flags from production line 1 year ~300GB
APDM Owner: TS Production team Full vehicle history including IPST (technical), IPSL (logistics), IPSQ test flags and all test results. 30 years ~TBs
FASTA Owner: Aftersales Dealership electronic tests Identifies early issues with cars >25TB
IQS: Initial Quality Survey from JD Power Owner: R&D Survey responses from new owners after 90 days for approx 1700 vehicles Few thousand lines ~MB
Social Data Owner: R&D Pulling 500MB per day from Twitter
TQP Owner: Supplier management PDFs of parts spec sheets ~ 500GB
© Copyright 2014 Pivotal. All rights reserved.
Generating value from data: Car configurator example
Configurations
Sales
Basic Recommendation Engine
Customers Configurations
Sales
Configurations All car elements: • Attribute frequencies (colors etc) • Attribute combination frequencies
Customers For instance: • Browsing history • Usage patterns • Demographic insights
© Copyright 2014 Pivotal. All rights reserved.
Advanced Recommendation Engine
Customers
Sales Ideally: • Volumes • Pricing • By market • Linkable to configurations
Configurations
Sales
Customers
Yield Optimization Engine
The value of data over time Value of Data ($)
Traditional Systems
“Fast Data”
“Big Data”
Pivotal Data Science Labs
µs © Copyright 2014 Pivotal. All rights reserved.
ms
s
hour
day
month year
yr+
Time
Data Science The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environment to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment. In order to drive automated low latency actions in response to events of interest © Copyright 2014 Pivotal. All rights reserved.
Why do Data Science? Business Transformation
Return on Analytics ($)
Innovation
Expansion “low hanging fruit”
Optimization Time
© Copyright 2014 Pivotal. All rights reserved.
What transforms businesses today? Digitization Internet of Things Pervasive Computing Pervasive Connectivity
© Copyright 2014 Pivotal. All rights reserved.
Example: major paradigm shifts in automotive Genesis
Mass Production
Modern Manufacturing
Platform Strategy
What’s Next?
1885
1908
1950s
1980s
2020
Not a horse
Mass availability
Brand proliferation
Globalization
Connected, autonomous vehicles
“You can have any color of car, provided it’s black”
© Copyright 2014 Pivotal. All rights reserved.
You can have any color
You can have anything anywhere
The Connected Car drives innovation Telematics
Stolen vehicle Remote recovery Behaviour diagnosis monitoring Remote Car2X Driver activation assistance solutions Floating Real-time eMobility car data parking info solutions Share my trip
Traffic updates Car search
Social Media
Handsfree telephony Music WiFi streaming hotspot Pay as Online you drive Payment Web games solutions radio Road Environmental Parking tolls browsing space reservation VoD Car sharing
eCommerce © Copyright 2014 Pivotal. All rights reserved.
Responsive Navigation PoIs Next gen Navigation Hybrid Real-time Navigation traffic info Map/PoI updates Vehicle Concierge Fleet Geo Tracking services Management fencing
City toll
Rich media comms Car2X comms
Entertainment
Communication
Data Science in Action The Value of Data Science The Practice of Data Science In-depth Use Case: Traffic Prediction Use Case Overviews Q&A © Copyright 2014 Pivotal. All rights reserved.
Traditional Analytics Process Data Sources Analysis
Structured Data • Sensors • Flight recordings
Scheduling Analysis
Data Warehouse
Visualization/ Quality Reporting
Operations Analysis
Time-to-Insights sample
In-memory statistics tool
In-memory optimization tool
forecast © Copyright 2014 Pivotal. All rights reserved.
solution
Augmenting an analytical architecture Data Sources Analysis
Structured Data • Sensors • Flight recordings Unstructured Raw Data Data Sequence Unstructured Image Data Geolocation Data Voice Transciption External Data Sources Weather Data Open Gov Data
© Copyright 2014 Pivotal. All rights reserved.
Data Warehouse
In Database Analytics
Visualization/ Quality Reporting
Benefits of a new architecture: • Eliminates data movement • Enables rapid data re-processing • Seamless integration of additional external resources into analyses
Machine Learning and Big Data Getting the whole picture improves predictive power
True positive rate
• More data from different sources • Provides a more complete view • Improves statistics and inference Sensor Data
False positive rate
© Copyright 2014 Pivotal. All rights reserved.
Sensor Data
Flight Records
Sensor Data
Flight Records
Geoloc & Weather
Sensor Data
Flight Records
Geoloc & Weather
Imaging
The main types of use cases in practice Data Mining
Predicting Behavior
Optimization
• Categorize types (segmentation)
• Predict churn likelihood
• Categorize behaviors/usage
• Predict cross/up sales potential
• Optimize processes
• Identify co-occurrences and
• Predict fraud/waste/abuse
• Optimize process
associations
likelihood
• Identify anomalies
• Predict performance
• Identify attitudes
• Predict reliability/quality
• Resolve entities
• Make a “recommendation”
© Copyright 2014 Pivotal. All rights reserved.
parameters • Optimize asset allocation
PIVOTAL DATA SCIENCE TOOLKIT 1
Find Data
Platforms • Greenplum DB • Pivotal HD • Hadoop (other) • SAS HPA • AWS
2
3
Run Code
Interfaces • pgAdminIII • psql • psycopg2 • Terminal • Cygwin • Putty • Winscp
Write Code
Editing Tools • Vi/Vim • Emacs • Smultron • TextWrangler • Eclipse • Notepad++ • IPython • Sublime
Languages • SQL • Bash scripting • C • C++ • C# • Java • Python • R
© Copyright 2014 Pivotal. All rights reserved.
4
Write Code for Big Data
In-Database • SQL • PL/Python • PL/Java • PL/R • PL/pgSQL 5
Hadoop • Pig • Hive • Java • HAWQ
6
Visualization • python-matplotlib • python-networkx • D3.js • Tableau
Implement Algorithms
Libraries • MADlib Java • Mahout R • (Too many to list!) Text • OpenNLP • NLTK • GPText C++ • opencv
Show Results
Python • numpy • scipy • scikit-learn • Pandas Programs • Alpine Miner • Rstudio • MATLAB • SAS • Stata
• GraphViz • Gephi • R (ggplot2, lattice, shiny) • Excel 7
Collaborate
Sharing Tools • Chorus • Confluence • Socialcast • Github • Google Drive & Hangouts
A large and varied tool box!
As Data Scientists, what do we want? Infrastructure Independent • Open source • PaaS
Fast & Scalable • In-
Schema Free • Hadoop
Real Time • In-memory
• SQL, not
database
data grids
Java
analytics
embedded
• Faster
• MPP
into the platform
© Copyright 2014 Pivotal. All rights reserved.
Easy to Use
than Hive
Data Science in Action The Value of Data Science The Practice of Data Science In-depth Use Case: Traffic Prediction Use Case Overviews Q&A © Copyright 2014 Pivotal. All rights reserved.
What does traffic data look like?
© Copyright 2014 Pivotal. All rights reserved.
…like this?
© Copyright 2014 Pivotal. All rights reserved.
…or this?
(Note: This is the least offensive topic cluster in our Twitter data!)
© Copyright 2014 Pivotal. All rights reserved.
Velocity by Time of Day
© Copyright 2014 Pivotal. All rights reserved.
Distribution of Velocity over Time
© Copyright 2014 Pivotal. All rights reserved.
Link 1000064869
0.015 0.000
0.005
0.010
density
0.020
0.025
0.030
Velocity Distribution
0
© Copyright 2014 Pivotal. All rights reserved.
50
100
km/h
150
200
Find Velocity Groups Velocity distributions can be fit well with Gaussians An ‘overlay’ of multiple Gaussians is called Gaussian Mixture Model
0.030 0.025 0.020 density
0.015 0.010 0.005
Shapes and positions of Gaussians determine velocity groups
Combined Component 1 Component 2 Component 3 Component 4
0.000
GMM fitting of the velocity distribution is done by ExpectationMaximization algorithm
Link 1000064869
0
50
100
150
km/h
© Copyright 2014 Pivotal. All rights reserved.
200
Gaussian Mixture Model Link 1000064869
1 2 3 4
Combined Component Component Component Component
1 2 3 4
0.020 density
0.000
0.005
0.010
0.015
0.020 density
0.015 0.010 0.005 0.000 0
50
100
150
200
0
50
100
150
200
km/h
km/h
Link 1000064869
0.025
1 2 3 4
density
0.000
0.000
0.005
0.005
0.010
0.010
0.015
0.015
0.020
0.020
0.025
0.030
Combined Component Component Component Component
0.030
Link 1000064869
density
Combined Component Component Component Component
0.025
1 2 3 4
0.025
0.030
Combined Component Component Component Component
0.030
Link 1000064869
0
0
50
© Copyright 2014 Pivotal. All rights reserved.
100
km/h
150
200
50
100
150
km/h
200
0.04 density
Decision Trees
0.00
0.01
Example
0.02
0.03
Combined Component 1 Component 2 Component 3
0
50
100
150
200
250
1 0.47 hour >= 14 < 14 1 0.69
2 0.56 weekday = 1,2,3,4,5
hour < 20
6,7
>= 20 1 0.76
2 0.55
weekday = 1,2,3,4,5
weekday = 1,2,3,4,5 6,7
1 0.81
2 0.63
3 0.85
© Copyright 2014 Pivotal. All rights reserved.
nextlink = −1,100000 6,7
2 0.73
3 0.73
3 0.65
100002 2 0.68
3 0.49
Sneak Peek at our TfL Data Demo Used the freely accessible TfL data for a demo Shows # of active disruptions over different days in London
Ø Rush hour effects visible Ø Nights are more quiet, but more disruptions on weekend nights
© Copyright 2014 Pivotal. All rights reserved.
Data Science in Action The Value of Data Science The Practice of Data Science In-depth Use Case: Traffic Prediction Use Case Overviews – not presented Q&A © Copyright 2014 Pivotal. All rights reserved.
BUILT FOR THE SPEED OF BUSINESS