Idea Transcript
FROM DESKTOP TO CLOUD TO EMBEDDED GPUS DESIGNING, TRAINING, AND COMPILING VISION AND DEEP LEARNING ALGORITHMS USING MATLAB
Avinash Nehemiah Joss Knight Girish Venkataramani
© 2017 The MathWorks, Inc. 1
Talk Outline
Design Deep Learning & Vision Algorithms • • • •
Highlights Manage large image sets Automate image labeling Easy access to models Pre-built training frameworks
Accelerate and Scale Training • •
Highlights Acceleration with GPU’s Scale to clusters
High Performance Embedded Implementation Highlights Automate compilation of MATLAB to CUDA 14x speedup over Caffe & 4x speedup over TensorFlow
2
Let’s Use Object Detection as an Example
In our example we’ll use deep learning for object detection.
TRUCK
CAR SUV
3
Two Approaches for Deep Learning 1. Train a Deep Neural Network from Scratch
2. Fine-tune a pre-trained model (transfer learning)
4
Transfer Learning Workflow Transfer Learning Images
Labels
Load Reference Network
Modify Network Structure
Learn New Weights
New Classifier
Alexnet, VGG-16, VGG-19, GoogLeNet
Training Data Labels: Car, Truck, Large Truck, SUV, Van 5
Manage Large Sets of Images Transfer Learning Images
Organize Images inLoad Folders Labels
(~
Modify Network Reference 10,000 images Structure Network, 5 folders)
Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system
Learn New Weights
New Classifier
imageData = imageDataStore(‘vehicles’) Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system
Handle Large Sets of Images
6
Automate Ground Truth Labeling Transfer Learning Images
Labels
Load Reference Network
Modify Network Structure
Learn New Weights
New Classifier
Ground Truth Labeling
7
Automate Ground Truth Labeling
8
Access Reference Models in MATLAB Transfer Learning Images
Labels
Load Reference Network
Modify Network Structure
Learn New Weights
New Classifier
Easily Load Reference Networks Access Models with 1-line of MATLAB Code Net1 = alexnet Net2 = vgg16 Net3 = vgg19 9
Access Reference Models in MATLAB
1. Reference Models
Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system
2. Model Importer
3. Tutorials
10
Modify Network Structure Transfer Learning Images
Labels
Load Reference Network
Modify Network Structure
Learn New Weights
New Classifier
Simple MATLAB API to modify layers: layers(23) = fullyConnectedLayer(5, 'Name','fc8'); layers(25) = classificationLayer('Name',‘VehicleClassifier')
11
Training Object Detectors Transfer Learning Images
Labels
Load Reference Network
Modify Network Structure
Learn New Weights
New Classifier
Train Any Network trainNetwork(datastore, layers, options)
Frameworks for Computer Vision • Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN • Machine Learning: ACF, Cascade Object Detectors 12
Visualizing and Debugging Intermediate Results Deep Dream
Training Accuracy Visualization
• •
Filters
Many options for visualizations and debugging Examples to get started
Layer Activations
Feature Visualization
…
Deep Dream Activations 13
Real World Systems Use More Than Deep Learning Deep learning vehicle detector performance degraded with environmental effects ( fog etc. )
Fog Removal Challenge: Deep learning frameworks do not include “classical” computer vision Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation 14
Talk Outline
Design Deep Learning & Vision Algorithms
Accelerate and Scale Training
High Performance Embedded Implementation
Can you solve “real” problems for production systems with MATLAB ? Doesn’t it take hours or days to train ?
15
Problems of acceleration and scale
How can I make my code run faster? How can I scale up to bigger problems?
Will I have to learn new tools? Will I have to learn new concepts?
16
MATLAB and Parallel Computing
Accelerate your code on the GPU – DEMO: Preprocess your image dataset
Scale to multi-GPU and clusters – DEMO: Parallel training with an Amazon P2 cluster
17
Transfer Learning with MATLAB
Images
Labels
?
Transfer Learning Load Reference Network
Modify Network Structure
Learn New Weights
New Classifier
18
19
20
21
22
23
24
25
26
Built-in function support Parallel Computing
Neural Networks Deep Learning, Neural Network training and simulation
Image Processing and Computer Vision Feature detection, transformations, filtering, object analysis
Over 300 core MATLAB functions optimized for GPU • • • • • • • •
Elementary math Linear algebra FFTs and IFFTs Convolution and filtering Fitting and interpolation Reductions and sorting Sparse matrix support double, single and integer support
Signal Processing and Communications FFT filtering, cross correlation, BER simulations
Statistics and Machine Learning Distributions, hypothesis testing, kmeans clustering, nearest neighbour
29
Programming with GPUs
GPU-optimized functions
Simple programming constructs
Writing kernels in the MATLAB language – arrayfun
Interface with your own CUDA C and C++ code
Ease of Use
Greater Control
– gpuArray, gather
– CUDAKernel, mexcuda
Prototyping
Test Framework 30
MATLAB and Parallel Computing
Accelerate your code on the GPU – DEMO: Preprocess your image dataset
Scale to multi-GPU and clusters – DEMO: Parallel training with an Amazon P2 cluster
31
Deep learning on CPU, GPU, multi-GPU and clusters
More GPUs 32
More CPUs
Deep learning on CPU, GPU, multi-GPU and clusters
More GPUs 33
More CPUs
Deep learning on CPU, GPU, multi-GPU and clusters
More GPUs 34
MATLAB and Parallel Computing
Accelerate your code on the GPU – DEMO: Preprocess your image dataset
Scale to multi-GPU and clusters – DEMO: Parallel training with an Amazon P2 cluster
36
Transfer learning in 11 lines of MATLAB code
37
38
Performance
30-40% reduction in training time each time you double your GPUs
Communicating with the Cloud costs nothing extra 39
Talk Outline
Design Deep Learning & Vision Algorithms
Accelerate and Scale Training
High Performance Embedded Implementation
Can you create high performance implementation from MATLAB code ?
40
Alexnet inference using MATLAB solution is ~14x faster than pyCaffe and 60% faster than C++ Caffe ~ 4x faster and ~3x less memory-use than TensorFlow
Why? Presenting the MATLAB to CUDA parallelizing compiler GPU Parallelization Kernel creation MATLAB
Memory allocation Data transfer minimization
41
Sample Generated CUDA Code MATLAB source code
Auto-generated CUDA code
GPU Parallelization
Kernel creation Memory allocation Data transfer minimization
42
MATLAB to CUDA compiler flow Library function mapping
MATLAB
(×)
cuBlas calls
(\)
cuSolver calls
fft
cuFFT calls
nnet
cuDNN calls
Front – end
….
Control-flow graph Intermediate representation (CFG – IR)
…. Traditional compiler optimizations
CUDA kernel optimizations
….
Parallel loop creation
Identify loop-nests that will become CUDA kernels
CUDA kernel creation
Convert loop to CUDA kernel Thread/blocks inferred from loop dims
cudaMemcpy minimization
Perform Use-def analysis. cudaMalloc GPU vars, insert memcpy
Shared memory synthesis
Infer data locality. Map to shared memory. Synthesize shared memory access
CUDA code emission 43
MATLAB to CUDA compiler: It’s all about big parallel loops! Library function mapping
MATLAB
Scalarization Front – end
Loop perfectization Loop optimizations
Control-flow graph Intermediate representation (CFG – IR)
Loop interchange Loop fusion Scalar replacement
….
Parallel loop creation
Traditional compiler optimizations
CUDA kernel creation CUDA kernel optimizations
cudaMemcpy minimization
….
Shared memory synthesis
CUDA code emission 44
MATLAB to CUDA compiler: It’s all about big parallel loops! Library function mapping
MATLAB
Scalarization Front – end
Loop perfectization Loop optimizations
Control-flow graph Intermediate representation (CFG – IR)
2 kernels (size N), 20*N bytes Loop interchange Loop fusion Scalar replacement
Parallel loop creation
…. Traditional compiler optimizations
1 kernel (size N), 16*N bytes
CUDA kernel creation CUDA kernel optimizations
cudaMemcpy minimization
….
Shared memory synthesis
CUDA code emission 45
cudaMemcpy minimization A(:) = …. C(:) = ….
cudaMemcpy *not* needed
for i = …. gB gA if
cudaMemcpy *definitely* needed
1:N = kernel1(gA); = kernel2(gB); (some_condition) gC = kernel3(gA, gB);
end …. end
cudaMemcpy *may be* needed
…. = C;
Assume gA, gB and gC are mapped to GPU memory Observations • Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a status flag per variable • Use-Def to determine where to insert memcpy
Generated (pseudo) code A(:) = … A_isDirtyOnCpu = true; … for i = 1:N if (A_isDirtyOnCpu) cudaMemcpy(gA, A); A_isDirtyOnCpu = false; end gB = kernel1(gA); gA = kernel2(gB); if (somecondition) gC = kernel3(gA, gB); C_isDirtyOnGpu = true; end … end … if (C_isDirtyOnGpu) cudaMemcpy(C, gC); C_isDirtyOnGpu = false; end … = C; 46
Shared memory synthesis for stencil operations
Conv. kernel
Output image
cols
kh
Input image
kw
rows Dotprod
For stencil operations, the MATLAB to CUDA compiler automatically • Infers GPU shared memory • Automates the collaborative loading in to shared memory block • Automatically translates access from global variable to shared-mem variable 47
Example: Compiling fog-rectification algorithm
48
MATLAB to CUDA Compilation in Computer Vision Applications Fog removal
Stereo disparity
Distance transform
SURF feature extraction
Ray tracing
49
Deep learning prediction performance: Alexnet
50
Deep learning prediction performance: Alexnet CPU GPU
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz Tesla K40c
4 3 2
CPU resident memory
GPU peak memory (nvidia-smi)
MATLAB to CUDA compiler
5
TensorFlow
6
C++-Caffe
7
Py-Caffe
Memory usage (GB)
8
MATLAB on CPU+GPU
9
1 0
1
16
Batch Size
32
64
51
Deep learning prediction performance: Alexnet Jetson (Tegra) TX1
Frame rate (Fps)
250
MATLAB to CUDA compiler
200
150
C++-Caffe 100
50
0 1
16
32
64
128
Batch Size 52
Deep Learning Prediction Performance: VGG-16 CPU
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
GPU
Tesla K40c
Frame rate (Fps)
90
MATLAB to CUDA compiler
80 70 60
C++ Caffe
50
MATLAB running on CPU+GPU
40 30
TensorFlow
20 10
py Caffe
0 1
16
32
Batch Size
64 53
Create CNNs with MATLAB, Deploy with MATLAB to CUDA compiler ~20 Fps (K40c) Vehicle Detection
Alexnet ~30 Fps (Tegra X1)
~130 Fps (K40c)
~66 Fps (Tegra X1) People detection
Lane detection
54
Conclusions
Design Deep Learning & Vision Algorithm
Deep learning design is easy in MATLAB
Accelerate and Scale Training
Parallel Computing Toolbox 7x faster than pyCaffe 2x faster than TensorFlow
High Performance Embedded Implementation MATLAB to CUDA compiler 14x faster than pyCaffe 4x faster than TensorFlow 1.6x faster than C++ Caffe 55
What Next? Visit our booth, we love to chat: Booth # 804
Try Deep Learning with MATLAB
MATLAB to CUDA compiler: Sign up for our beta program
www.mathworks.com/matlab-cuda-beta
56