Deep Learning [PDF]

Design Deep. Learning & Vision. Algorithms. High Performance. Embedded. Implementation. Highlights. • Manage large

4 downloads 7 Views 4MB Size

Recommend Stories


[PDF] Deep Learning
If you feel beautiful, then you are. Even if you don't, you still are. Terri Guillemets

R Deep Learning Cookbook Pdf
Learning never exhausts the mind. Leonardo da Vinci

Deep learning
Ask yourself: What role does gratitude play in your life? Next

Deep learning
The only limits you see are the ones you impose on yourself. Dr. Wayne Dyer

deep learning
Pretending to not be afraid is as good as actually not being afraid. David Letterman

Deep Learning
Don't watch the clock, do what it does. Keep Going. Sam Levenson

Deep Learning
Make yourself a priority once in a while. It's not selfish. It's necessary. Anonymous

Deep Learning
Learn to light a candle in the darkest moments of someone’s life. Be the light that helps others see; i

Deep Learning
In every community, there is work to be done. In every nation, there are wounds to heal. In every heart,

Deep Learning
Almost everything will work again if you unplug it for a few minutes, including you. Anne Lamott

Idea Transcript


FROM DESKTOP TO CLOUD TO EMBEDDED GPUS DESIGNING, TRAINING, AND COMPILING VISION AND DEEP LEARNING ALGORITHMS USING MATLAB

Avinash Nehemiah Joss Knight Girish Venkataramani

© 2017 The MathWorks, Inc. 1

Talk Outline

Design Deep Learning & Vision Algorithms • • • •

Highlights Manage large image sets Automate image labeling Easy access to models Pre-built training frameworks

Accelerate and Scale Training • •

Highlights Acceleration with GPU’s Scale to clusters

High Performance Embedded Implementation Highlights  Automate compilation of MATLAB to CUDA  14x speedup over Caffe & 4x speedup over TensorFlow

2

Let’s Use Object Detection as an Example

In our example we’ll use deep learning for object detection.

TRUCK

CAR SUV

3

Two Approaches for Deep Learning 1. Train a Deep Neural Network from Scratch

2. Fine-tune a pre-trained model (transfer learning)

4

Transfer Learning Workflow Transfer Learning Images

Labels

Load Reference Network

Modify Network Structure

Learn New Weights

New Classifier

Alexnet, VGG-16, VGG-19, GoogLeNet

Training Data Labels: Car, Truck, Large Truck, SUV, Van 5

Manage Large Sets of Images Transfer Learning Images

Organize Images inLoad Folders Labels

(~

Modify Network Reference 10,000 images Structure Network, 5 folders)

Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system

Learn New Weights

New Classifier

imageData = imageDataStore(‘vehicles’) Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system

Handle Large Sets of Images

6

Automate Ground Truth Labeling Transfer Learning Images

Labels

Load Reference Network

Modify Network Structure

Learn New Weights

New Classifier

Ground Truth Labeling

7

Automate Ground Truth Labeling

8

Access Reference Models in MATLAB Transfer Learning Images

Labels

Load Reference Network

Modify Network Structure

Learn New Weights

New Classifier

Easily Load Reference Networks Access Models with 1-line of MATLAB Code Net1 = alexnet Net2 = vgg16 Net3 = vgg19 9

Access Reference Models in MATLAB

1. Reference Models

Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system

2. Model Importer

3. Tutorials

10

Modify Network Structure Transfer Learning Images

Labels

Load Reference Network

Modify Network Structure

Learn New Weights

New Classifier

Simple MATLAB API to modify layers: layers(23) = fullyConnectedLayer(5, 'Name','fc8'); layers(25) = classificationLayer('Name',‘VehicleClassifier')

11

Training Object Detectors Transfer Learning Images

Labels

Load Reference Network

Modify Network Structure

Learn New Weights

New Classifier

Train Any Network trainNetwork(datastore, layers, options)

Frameworks for Computer Vision • Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN • Machine Learning: ACF, Cascade Object Detectors 12

Visualizing and Debugging Intermediate Results Deep Dream

Training Accuracy Visualization

• •

Filters

Many options for visualizations and debugging Examples to get started

Layer Activations

Feature Visualization



Deep Dream Activations 13

Real World Systems Use More Than Deep Learning Deep learning vehicle detector performance degraded with environmental effects ( fog etc. )

Fog Removal Challenge: Deep learning frameworks do not include “classical” computer vision Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation 14

Talk Outline

Design Deep Learning & Vision Algorithms

Accelerate and Scale Training

High Performance Embedded Implementation

 Can you solve “real” problems for production systems with MATLAB ?  Doesn’t it take hours or days to train ?

15

Problems of acceleration and scale  

 

How can I make my code run faster? How can I scale up to bigger problems?

Will I have to learn new tools? Will I have to learn new concepts?

16

MATLAB and Parallel Computing



Accelerate your code on the GPU – DEMO: Preprocess your image dataset



Scale to multi-GPU and clusters – DEMO: Parallel training with an Amazon P2 cluster

17

Transfer Learning with MATLAB

Images

Labels

?

Transfer Learning Load Reference Network

Modify Network Structure

Learn New Weights

New Classifier

18

19

20

21

22

23

24

25

26

Built-in function support Parallel Computing

Neural Networks Deep Learning, Neural Network training and simulation

Image Processing and Computer Vision Feature detection, transformations, filtering, object analysis

Over 300 core MATLAB functions optimized for GPU • • • • • • • •

Elementary math Linear algebra FFTs and IFFTs Convolution and filtering Fitting and interpolation Reductions and sorting Sparse matrix support double, single and integer support

Signal Processing and Communications FFT filtering, cross correlation, BER simulations

Statistics and Machine Learning Distributions, hypothesis testing, kmeans clustering, nearest neighbour

29

Programming with GPUs



GPU-optimized functions



Simple programming constructs

Writing kernels in the MATLAB language – arrayfun



Interface with your own CUDA C and C++ code

Ease of Use



Greater Control

– gpuArray, gather

– CUDAKernel, mexcuda

Prototyping

Test Framework 30

MATLAB and Parallel Computing



Accelerate your code on the GPU – DEMO: Preprocess your image dataset



Scale to multi-GPU and clusters – DEMO: Parallel training with an Amazon P2 cluster

31

Deep learning on CPU, GPU, multi-GPU and clusters

More GPUs 32

More CPUs

Deep learning on CPU, GPU, multi-GPU and clusters

More GPUs 33

More CPUs

Deep learning on CPU, GPU, multi-GPU and clusters

More GPUs 34

MATLAB and Parallel Computing



Accelerate your code on the GPU – DEMO: Preprocess your image dataset



Scale to multi-GPU and clusters – DEMO: Parallel training with an Amazon P2 cluster

36

Transfer learning in 11 lines of MATLAB code

37

38

Performance 

30-40% reduction in training time each time you double your GPUs



Communicating with the Cloud costs nothing extra 39

Talk Outline

Design Deep Learning & Vision Algorithms

Accelerate and Scale Training

High Performance Embedded Implementation

Can you create high performance implementation from MATLAB code ?

40

Alexnet inference using MATLAB solution is ~14x faster than pyCaffe and 60% faster than C++ Caffe ~ 4x faster and ~3x less memory-use than TensorFlow

Why? Presenting the MATLAB to CUDA parallelizing compiler GPU Parallelization Kernel creation MATLAB

Memory allocation Data transfer minimization

41

Sample Generated CUDA Code MATLAB source code

Auto-generated CUDA code

GPU Parallelization

Kernel creation Memory allocation Data transfer minimization

42

MATLAB to CUDA compiler flow Library function mapping

MATLAB

(×)

cuBlas calls

(\)

cuSolver calls

fft

cuFFT calls

nnet

cuDNN calls

Front – end

….

Control-flow graph Intermediate representation (CFG – IR)

…. Traditional compiler optimizations

CUDA kernel optimizations

….

Parallel loop creation

Identify loop-nests that will become CUDA kernels

CUDA kernel creation

Convert loop to CUDA kernel Thread/blocks inferred from loop dims

cudaMemcpy minimization

Perform Use-def analysis. cudaMalloc GPU vars, insert memcpy

Shared memory synthesis

Infer data locality. Map to shared memory. Synthesize shared memory access

CUDA code emission 43

MATLAB to CUDA compiler: It’s all about big parallel loops! Library function mapping

MATLAB

Scalarization Front – end

Loop perfectization Loop optimizations

Control-flow graph Intermediate representation (CFG – IR)

Loop interchange Loop fusion Scalar replacement

….

Parallel loop creation

Traditional compiler optimizations

CUDA kernel creation CUDA kernel optimizations

cudaMemcpy minimization

….

Shared memory synthesis

CUDA code emission 44

MATLAB to CUDA compiler: It’s all about big parallel loops! Library function mapping

MATLAB

Scalarization Front – end

Loop perfectization Loop optimizations

Control-flow graph Intermediate representation (CFG – IR)

2 kernels (size N), 20*N bytes Loop interchange Loop fusion Scalar replacement

Parallel loop creation

…. Traditional compiler optimizations

1 kernel (size N), 16*N bytes

CUDA kernel creation CUDA kernel optimizations

cudaMemcpy minimization

….

Shared memory synthesis

CUDA code emission 45

cudaMemcpy minimization A(:) = …. C(:) = ….

cudaMemcpy *not* needed

for i = …. gB gA if

cudaMemcpy *definitely* needed

1:N = kernel1(gA); = kernel2(gB); (some_condition) gC = kernel3(gA, gB);

end …. end

cudaMemcpy *may be* needed

…. = C;

Assume gA, gB and gC are mapped to GPU memory Observations • Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a status flag per variable • Use-Def to determine where to insert memcpy

Generated (pseudo) code A(:) = … A_isDirtyOnCpu = true; … for i = 1:N if (A_isDirtyOnCpu) cudaMemcpy(gA, A); A_isDirtyOnCpu = false; end gB = kernel1(gA); gA = kernel2(gB); if (somecondition) gC = kernel3(gA, gB); C_isDirtyOnGpu = true; end … end … if (C_isDirtyOnGpu) cudaMemcpy(C, gC); C_isDirtyOnGpu = false; end … = C; 46

Shared memory synthesis for stencil operations

Conv. kernel

Output image

cols

kh

Input image

kw

rows Dotprod

For stencil operations, the MATLAB to CUDA compiler automatically • Infers GPU shared memory • Automates the collaborative loading in to shared memory block • Automatically translates access from global variable to shared-mem variable 47

Example: Compiling fog-rectification algorithm

48

MATLAB to CUDA Compilation in Computer Vision Applications Fog removal

Stereo disparity

Distance transform

SURF feature extraction

Ray tracing

49

Deep learning prediction performance: Alexnet

50

Deep learning prediction performance: Alexnet CPU GPU

Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz Tesla K40c

4 3 2

CPU resident memory

GPU peak memory (nvidia-smi)

MATLAB to CUDA compiler

5

TensorFlow

6

C++-Caffe

7

Py-Caffe

Memory usage (GB)

8

MATLAB on CPU+GPU

9

1 0

1

16

Batch Size

32

64

51

Deep learning prediction performance: Alexnet Jetson (Tegra) TX1

Frame rate (Fps)

250

MATLAB to CUDA compiler

200

150

C++-Caffe 100

50

0 1

16

32

64

128

Batch Size 52

Deep Learning Prediction Performance: VGG-16 CPU

Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz

GPU

Tesla K40c

Frame rate (Fps)

90

MATLAB to CUDA compiler

80 70 60

C++ Caffe

50

MATLAB running on CPU+GPU

40 30

TensorFlow

20 10

py Caffe

0 1

16

32

Batch Size

64 53

Create CNNs with MATLAB, Deploy with MATLAB to CUDA compiler ~20 Fps (K40c) Vehicle Detection

Alexnet ~30 Fps (Tegra X1)

~130 Fps (K40c)

~66 Fps (Tegra X1) People detection

Lane detection

54

Conclusions

Design Deep Learning & Vision Algorithm

Deep learning design is easy in MATLAB

Accelerate and Scale Training

Parallel Computing Toolbox 7x faster than pyCaffe 2x faster than TensorFlow

High Performance Embedded Implementation MATLAB to CUDA compiler 14x faster than pyCaffe 4x faster than TensorFlow 1.6x faster than C++ Caffe 55

What Next? Visit our booth, we love to chat: Booth # 804

Try Deep Learning with MATLAB

MATLAB to CUDA compiler: Sign up for our beta program

www.mathworks.com/matlab-cuda-beta

56

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.