Convolutional Neural Networks [PDF]

PARRSLAB. • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural netw

0 downloads 3 Views 10MB Size

Recommend Stories


Convolutional neural networks
Never wish them pain. That's not who you are. If they caused you pain, they must have pain inside. Wish

Convolutional neural networks
When you talk, you are only repeating what you already know. But if you listen, you may learn something

Lecture 5: Convolutional Neural Networks
Just as there is no loss of basic energy in the universe, so no thought or action is without its effects,

Calibration of Convolutional Neural Networks
Learning never exhausts the mind. Leonardo da Vinci

Local Binary Convolutional Neural Networks
Learn to light a candle in the darkest moments of someone’s life. Be the light that helps others see; i

Convolutional Neural Networks for Brain Networks
No amount of guilt can solve the past, and no amount of anxiety can change the future. Anonymous

Convolutional Neural Networks at Constrained Time Cost
Happiness doesn't result from what we get, but from what we give. Ben Carson

Convolutional Neural Networks for Medical Clustering
Those who bring sunshine to the lives of others cannot keep it from themselves. J. M. Barrie

Directing Attention of Convolutional Neural Networks
Be like the sun for grace and mercy. Be like the night to cover others' faults. Be like running water

Fast Algorithm For Quantized Convolutional Neural Networks
Every block of stone has a statue inside it and it is the task of the sculptor to discover it. Mich

Idea Transcript


Convolutional Neural Networks

Discrete filtering in 2D Convolution

PARRSLAB

2-D Convolution • Convolution = Spatial filtering •  Same equation, one more index What does this convolution kernel do?

–  now filter is areveal rectangle you slidecharacteristics around over a grid of numbers • Different filtersthe (weights) a different of the input.

•  Usefulness of associativity –  often apply several filters0 one 1 after 0 another: (((a * b1) * b2) * b3) –  this is equivalent to 1/8 applying 1 4one1 filter: a * (b1 * b2 * b3)



0

1

0 2

Discrete filtering in 2D Convolution

PARRSLAB

2-D Convolution • Convolution = Spatial filtering •  Same equation, one more index What does this convolution kernel do?

–  now filter is areveal rectangle you slidecharacteristics around over a grid of numbers • Different filtersthe (weights) a different of the input.

•  Usefulness of associativity –  often apply several filters0 one -1 after 0 another: (((a * b1) * b2) * b3) –  this is equivalent to applying -1 4one -1filter: a * (b1 * b2 * b3)



0

-1

0 3

Discrete filtering in 2D Convolution

PARRSLAB

2-D Convolution • Convolution = Spatial filtering •  Same equation, one more index What does this convolution kernel do?

–  now filter is areveal rectangle you slidecharacteristics around over a grid of numbers • Different filtersthe (weights) a different of the input.

•  Usefulness of associativity –  often apply several filters1 one 0 after -1 another: (((a * b1) * b2) * b3) –  this is equivalent to applying 2 0one -2filter: a * (b1 * b2 * b3)



1

0

-1 4

4 PARRSLAB

CNNs - A review

Review

• A neural network model that consists of a sequence of local & translation invariant layers − Many identical copies of the same neuron: Weight/parameter sharing − Hierarchical feature learning

c1

c2

c3

c4

c5

f6

f7

f8

bike

w1

w2

w3

w4

w5

w6

w7

w8

AlexNet

Image credit: Andrea Vedaldi

5

PARRSLAB

CNNs - A bit of history • Neurocognitron model by Fukushima (1980) • The first convolutional neural network (CNN) model • so-called “sandwich” architecture − simple cells act like filters − complex cells perform pooling

• Difficult to train − No backpropagation yet

6

PARRSLAB

CNNs - A bit of history • Gradient-based learning applied to document recognition [LeCun, Bottou, Bengio, Haffner 1998] • LeNet-5 model INPUT 32x32

C1: feature maps 6@28x28

Convolutions

C3: f. maps 16@10x10 S4: f. maps 16@5x5 S2: f. maps 6@14x14

Subsampling

C5: layer F6: layer 120 84

Convolutions

OUTPUT 10

Gaussian connections Full connection Subsampling Full connection

PARRSLAB

CNNs - A bit of history • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, 2012. • AlexNet model

8

PARRSLAB

Convolutional layer

Example: convolution layer

• Learn a filter bank (a set of filters) once • Use them over the input data to extract features



input data x

filter bank F Image credit: Andrea Vedaldi

output data y 9

PARRSLAB

Data = 3D Tensor

Data = 3D tensors

6

There is a vector of feature channels (e.g. RGB) at each spatial location (pixel). • There is a vector of feature channels (e.g. RGB) at each spatial location W channels (pixel).

=

H

c=1

c=2

c=3

W

3D
 tensor

=

H

C

Slide credit: Andrea Vedaldi

10

PARRSLAB

Convolution with 3D filters Convolution with 3D filters

7

Each filter acts on multiple input channels

• Each filter acts on multiple input channels

Local
 Local Filters look locally
 Filters 
 look locally

F


 Translation invariant
 Translation invariant Filters act the same
 Filters act the same everywhere everywhere

Σ

x

y

Slide credit: Andrea Vedaldi

11

PARRSLAB

Convolutional Layer 32x32x3 input 5x5x3 filter 32 Convolve the filter with the input i.e. “slide over the image spatially, computing dot products” 32 3

Slide credit: Andrej Karpathy

12

PARRSLAB

Convolutional Layer

32

32

32x32x3 input 5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the input (i.e. 5*5*3 = 75-dimensional dot product + bias)

3

Slide credit: Andrej Karpathy

13

PARRSLAB

Convolutional Layer

32

32x32x3 input 5x5x3 filter

activation map

28 convolve (slide) over all spatial locations 28

32 3

1

Slide credit: Andrej Karpathy

14

PARRSLAB

Convolutional Layer consider a second, green filter

32

32x32x3 input 5x5x3 filter

activation maps

28 convolve (slide) over all spatial locations 28

32 3

1

Slide credit: Andrej Karpathy

15

PARRSLAB

Convolutional Layer • Multiple filters produce multiple output channels • For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: activation maps 32 28 Convolutional Layer

32 3

28 6 We stack these up to get an output of size 28x28x6.

Slide credit: Andrej Karpathy

16

PARRSLAB

Linear / non-linear chains

9

Linear / non-linear chains The basic blueprint of most architectures • The basic blueprint: The sandwich architecture • Stack multiple layers of convolutions

Σ

S

Σ

Σ

S



filtering
 & downsampling

ReLU



S

y

x filtering

ReLU

Slide credit: Andrea Vedaldi

17

Convolutional layers

Convolutional layers • Local receptive field • Each column of hidden units looks at a different input patch

PARRSLAB

Local receptive field

features

feature component

input
 image receptive field

Slide credit: Andrea Vedaldi

18

PARRSLAB

Feature Learning • Hierarchical layer structure allows to learn hierarchical filters (features).

Slide credit: Andrej Karpathy

19

PARRSLAB

Feature Learning • Hierarchical layer structure allows to learn hierarchical filters (features).

Slide credit: Yann LeCun

20

PARRSLAB

Pooling layer • makes the representations smaller and more manageable • operates over each activation map independently: • Max pooling, average pooling, etc. Single depth slice x

1

1

2

4

5

6

7

8

3

2

1

0

1

2

3

4

max pool with 2x2 filters and stride 2

y

6

8

3

4

Slide credit: Andrej Karpathy

21

PARRSLAB

Fully connected layer • contains neurons that connect to the entire input volume, as in ordinary Neural Networks

Slide credit: Andrej Karpathy

22

Fully connected layers PARRSLAB

Global receptive field

Fully connected layers • Global receptive field • Each hidden unit looks at the entire image

38

class predictions fully-connected

fully-connected

fully-connected

Slide credit: Andrea Vedaldi

23

Responses are spatially selective, Responses are spatially selective, can be used to localize things. can be used to localize things.

Responses are global, do not Responses are global, do not PARRSLAB characterize well position characterize well position

Convolutional vs Fully connected

• Convolutional layers: • Fully Which one is
connected Which one is
 layers: Responses are spatially selective, more useful for
 more useful for
 can be used to localize things. Responses are global, pixel level pixellabelling? level labelling? do not characterize well position

Slide credit: Andrea Vedaldi

24

PARRSLAB

Fully-connected layer = large filter

40

Fully connected layer = large filter • Fully connected layer can be interpreted as a very large filter who spans the whole input data

1⨉1⨉K

=

K w(k)



F(k) W⨉H⨉C⨉K

W⨉H⨉C

Slide credit: Andrea Vedaldi

25

Fully-convolutional neural networks

PARRSLAB

Fully-convolutional neural networks

• Proposed for pixel-level labeling (e.g. semantic segmentation) class predictions

Slide credit: Andrea Vedaldi

26

PARRSLAB

CNN Demo • ConvNetJS demo: training on CIFAR-10 • http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

27

PARRSLAB

CNNs - Years of progress • From LeNet (1998) to ResNet (2015)

28

PARRSLAB

How deep is enough? LeNet (1998) SoftmaxOutput

10 FullyConnected 10

2 convolutional layers 2 fully connected layers

500 Activation tanh 500 FullyConnected 500 800

Flatten

50x4x4 Pooling max, 2x2/2 50x8x8 Activation tanh 50x8x8 Convolution 5x5/1, 50 20x12x12 Pooling max, 2x2/2 20x24x24 Activation tanh 20x24x24 Convolution 5x5/1, 20

29

PARRSLAB

How deep is enough? LeNet (1998) SoftmaxOutput

10 FullyConnected 10 500 Activation tanh 500 FullyConnected 500 800

Flatten

50x4x4 Pooling max, 2x2/2

2 convolutional layers 2 fully connected layers

AlexNet (2012) SoftmaxOutput

2 FullyConnected 2 4096

Dropout

4096 Activation relu

5 convolutional layers 3 fully connected layers

4096 FullyConnected 4096 4096

50x8x8 Activation tanh 50x8x8 Convolution 5x5/1, 50 20x12x12 Pooling max, 2x2/2

Dropout

4096 Activation relu 4096 FullyConnected 4096

20x24x24 Activation tanh 20x24x24 Convolution 5x5/1, 20

9216

Flatten

256x6x6 Pooling max, 3x3/2 256x13x13 Activation relu 256x13x13 Convolution 3x3/1, 256 384x13x13 Activation relu 384x13x13 Convolution 3x3/1, 384 384x13x13 Activation relu 384x13x13 Convolution 3x3/1, 384 256x13x13

LRN

256x13x13 Pooling max, 3x3/2 256x27x27 Activation relu 256x27x27 Convolution 5x5/1, 256 96x27x27

LRN

96x27x27 Pooling max, 3x3/2 96x54x54 Activation relu 96x54x54 Convolution 11x11/4, 96

30

PARRSLAB

How deep is enough? LeNet (1998) SoftmaxOutput

10 FullyConnected 10

AlexNet (2012) SoftmaxOutput

500 FullyConnected 500 800

Flatten

50x4x4 Pooling max, 2x2/2

SoftmaxOutput

2

2 FullyConnected 2

500 Activation tanh

VGGNet-M (2013)

4096

Dropout

4096 Activation relu 4096 FullyConnected 4096

FullyConnected 2 4096

Dropout

4096 Activation relu 4096 FullyConnected 4096 4096

Dropout

4096 50x8x8 Activation tanh 50x8x8 Convolution 5x5/1, 50 20x12x12 Pooling max, 2x2/2

4096

Dropout

4096 Activation relu 4096 FullyConnected 4096

20x24x24 Activation tanh 20x24x24 Convolution 5x5/1, 20

9216

Flatten

Activation relu 4096 FullyConnected 4096 25088

Flatten

512x7x7 Pooling max, 2x2/2 512x14x14

256x6x6 Pooling max, 3x3/2 256x13x13 Activation relu 256x13x13 Convolution 3x3/1, 256 384x13x13

Activation relu 512x14x14 Convolution 3x3/1, 512 512x14x14 Activation relu 512x14x14 Convolution 3x3/1, 512 512x14x14

Activation relu 384x13x13 Convolution 3x3/1, 384 384x13x13 Activation relu 384x13x13 Convolution 3x3/1, 384

Pooling max, 2x2/2 512x28x28 Activation relu 512x28x28 Convolution 3x3/1, 512 512x28x28 Activation relu 512x28x28

256x13x13 Convolution 3x3/1, 512

LRN 256x28x28

256x13x13 Pooling max, 3x3/2 256x27x27 Activation relu 256x27x27 Convolution 5x5/1, 256

Pooling max, 2x2/2 256x56x56 Activation relu 256x56x56 Convolution 3x3/1, 256 256x56x56 Activation relu

96x27x27 256x56x56

LRN

96x27x27 Pooling max, 3x3/2 96x54x54 Activation relu

Convolution 3x3/1, 256 128x56x56 Pooling max, 2x2/2 128x112x112 Activation relu 128x112x112

96x54x54 Convolution 11x11/4, 96

Convolution 3x3/1, 128 64x112x112 Pooling max, 2x2/2 64x224x224 Activation relu 64x224x224 Convolution 3x3/1, 64

31

PARRSLAB

How deep is enough? LeNet (1998)

AlexNet (2012) SoftmaxOutput

SoftmaxOutput

FullyConnected 10 500

VGGNet-M (2013)

GoogLeNet (2014) SoftmaxOutput

SoftmaxOutput

2

2 10

FullyConnected 2

2 FullyConnected 2

FullyConnected 2

1024

Flatten

4096

4096

1024x1x1

Dropout Activation tanh 500 FullyConnected 500

Pooling avg, 7x7/1

Dropout 4096

4096 Activation relu

1024x7x7

Activation relu

Concat

384x7x7 384x7x7 128x7x7 Activation relu

4096

Pooling max, 2x2/2

Activation relu

FullyConnected 4096

Activation relu

Activation relu

FullyConnected 4096

Flatten

50x4x4

128x7x7

4096

800

384x7x7

128x7x7

128x7x7

4096 Convolution 3x3/1, 384

384x7x7

Convolution 5x5/1, 128

Convolution 1x1/1, 128

Dropout

4096

Convolution 1x1/1, 384

192x7x7

48x7x7

832x7x7

4096 50x8x8 Activation tanh

Dropout

4096

20x12x12 Pooling max, 2x2/2 20x24x24 Activation tanh

832x7x7

Activation relu

192x7x7

Pooling max, 3x3/1

48x7x7

4096

50x8x8 Convolution 5x5/1, 50

Activation relu

Activation relu

Activation relu

Convolution 1x1/1, 192

FullyConnected 4096

Convolution 1x1/1, 48

832x7x7

4096

832x7x7

832x7x7

25088 Concat

FullyConnected 4096 9216

Flatten

Flatten

256x7x7 320x7x7 128x7x7 Activation relu

512x7x7 Pooling max, 2x2/2 512x14x14

20x24x24

Activation relu

128x7x7

Activation relu

320x7x7

Activation relu

128x7x7

Convolution 3x3/1, 320

256x7x7

128x7x7

Convolution 5x5/1, 128

Convolution 1x1/1, 128

256x6x6 Convolution 5x5/1, 20

Pooling max, 3x3/2 256x13x13 Activation relu 256x13x13

Activation relu

Convolution 1x1/1, 256

160x7x7

Convolution 3x3/1, 512

32x7x7

Activation relu

512x14x14

832x7x7

512x14x14

832x7x7

Activation relu

160x7x7

Activation relu

Pooling max, 3x3/1

32x7x7

Convolution 1x1/1, 160

Convolution 1x1/1, 32

832x7x7

832x7x7 832x7x7 Pooling max, 3x3/2

Convolution 3x3/1, 256 384x13x13 Activation relu

512x14x14 832x14x14

Convolution 3x3/1, 512

512x28x28

Convolution 3x3/1, 384 384x13x13 Activation relu 384x13x13 Convolution 3x3/1, 384

Concat

256x14x14 320x14x14 128x14x14

512x14x14

Activation relu

Pooling max, 2x2/2

384x13x13

Activation relu 512x28x28

Activation relu

256x14x14

512x28x28

Convolution 5x5/1, 128

160x14x14

528x14x14

128x14x14

528x14x14

Activation relu

160x14x14

Pooling max, 3x3/1

32x14x14

Convolution 1x1/1, 160

Activation relu

Convolution 1x1/1, 128

32x14x14

Activation relu

Convolution 3x3/1, 512

Activation relu

128x14x14

Convolution 3x3/1, 320

Convolution 1x1/1, 256

128x14x14

Activation relu

320x14x14

Convolution 1x1/1, 32

528x14x14

528x14x14 528x14x14

512x28x28 Concat

256x13x13 Convolution 3x3/1, 512

112x14x14 288x14x14 64x14x14

LRN

Activation relu

256x28x28

256x13x13 Pooling max, 3x3/2 256x27x27 Activation relu 256x27x27 Convolution 5x5/1, 256

Pooling max, 2x2/2 256x56x56 Activation relu

Activation relu

Convolution 3x3/1, 256 256x56x56

512x14x14

Activation relu

144x14x14

Convolution 1x1/1, 144

Activation relu

64x14x14

Convolution 1x1/1, 64

32x14x14

Activation relu

512x14x14

Activation relu

64x14x14

Convolution 5x5/1, 64

144x14x14

256x56x56

64x14x14

Activation relu

288x14x14

Convolution 3x3/1, 288

112x14x14

Convolution 1x1/1, 112

Pooling max, 3x3/1

32x14x14

Convolution 1x1/1, 32

512x14x14

512x14x14 512x14x14

96x27x27

Concat

256x56x56

LRN

96x27x27 Pooling max, 3x3/2 96x54x54 Activation relu

128x14x14 256x14x14 64x14x14

Convolution 3x3/1, 256 128x56x56 Pooling max, 2x2/2 128x112x112

Activation relu

Activation relu

Convolution 1x1/1, 128

Activation relu

64x14x14

Convolution 3x3/1, 256

Convolution 5x5/1, 64

128x14x14

Activation relu

64x14x14

Activation relu

256x14x14

128x14x14

64x14x14

Convolution 1x1/1, 64

24x14x14

Activation relu

512x14x14

Activation relu

128x112x112

96x54x54 Convolution 11x11/4, 96

512x14x14

Convolution 3x3/1, 128

128x14x14

Convolution 1x1/1, 128

64x112x112

Pooling max, 3x3/1

24x14x14

Convolution 1x1/1, 24

512x14x14

512x14x14 512x14x14

Pooling max, 2x2/2

Concat

160x14x14 224x14x14 64x14x14

64x224x224

Activation relu

Activation relu Activation relu

64x14x14

Activation relu

224x14x14

Activation relu

64x14x14

64x14x14

64x224x224 Convolution 3x3/1, 64

160x14x14

Convolution 3x3/1, 224

Convolution 1x1/1, 160

Convolution 5x5/1, 64

112x14x14

24x14x14

Activation relu

512x14x14

Convolution 1x1/1, 64

512x14x14

Activation relu

112x14x14

Pooling max, 3x3/1

24x14x14

Convolution 1x1/1, 112

Convolution 1x1/1, 24

512x14x14

512x14x14 512x14x14

Concat

192x14x14 208x14x14 48x14x14 Activation relu

Activation relu

192x14x14

Convolution 1x1/1, 192

Activation relu

48x14x14

Convolution 5x5/1, 48

96x14x14

64x14x14

Convolution 1x1/1, 64

16x14x14

Activation relu

480x14x14

64x14x14

Activation relu

208x14x14

Convolution 3x3/1, 208

480x14x14

Activation relu

96x14x14

Convolution 1x1/1, 96

Pooling max, 3x3/1

16x14x14

Convolution 1x1/1, 16

480x14x14

480x14x14 480x14x14 Pooling max, 3x3/2 480x28x28

Concat

128x28x28 192x28x28 96x28x28 Activation relu

Activation relu

128x28x28

Convolution 1x1/1, 128

Activation relu

96x28x28

Convolution 5x5/1, 96

128x28x28

64x28x28

Convolution 1x1/1, 64

32x28x28

Activation relu

256x28x28

64x28x28

Activation relu

192x28x28

Convolution 3x3/1, 192

256x28x28

Activation relu

128x28x28

Convolution 1x1/1, 128

Pooling max, 3x3/1

32x28x28

Convolution 1x1/1, 32

256x28x28

256x28x28 256x28x28

Concat

64x28x28 128x28x28 32x28x28 Activation relu

Activation relu

64x28x28

32x28x28

Convolution 5x5/1, 32

96x28x28

16x28x28

Activation relu

192x28x28

Activation relu

32x28x28

Convolution 1x1/1, 32

192x28x28

Activation relu

96x28x28

Convolution 1x1/1, 96

32x28x28

Activation relu

128x28x28

Convolution 3x3/1, 128

Convolution 1x1/1, 64

16x28x28

Convolution 1x1/1, 16

Pooling max, 3x3/1

192x28x28

192x28x28 192x28x28 Pooling max, 3x3/2 192x56x56 Activation relu 192x56x56 Convolution 3x3/1, 192 64x56x56 Activation relu 64x56x56 Convolution 1x1/1, 64 64x56x56 Pooling max, 3x3/2 64x112x112 Activation relu 64x112x112 Convolution 7x7/2, 64

32

How deep is enough?

15 PARRSLAB

How deep is enough? GoogLeNet (2014)

ResNet 50 (2015)

VGG-VD-16 (2014)

ResNet 152 (2015)

VGG-M (2013) AlexNet (2012)

16 convolutional layers

50 convolutional layers

Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proc. NIPS, 2012.
 C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. CVPR, 2015.
 K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, 2015.

152 convolutional layers

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.

Slide credit: Andrea Vedaldi

33

Accuracy

PARRSLAB

Accuracy

16

3 ⨉ more accurate in 3 years

20.0

2.6

17.5

2.3

15.0

2.0

12.5 10.0 7.5 5.0

More accurate

Top 5 error

• 3 ⨉ more accurate in 3 years

1.6 1.3 1.0 0.7

2.5

0.3

0.0

0.0

x -f g 6 g g g m -1 -da da da -da ale vgg ggp t v e ee 50 01 52 ffe len yd net- et-1 et-1 ca r g o n n ve s go g- re es res r g v

x -f g 6 g g g m -1 -da da da -da a l e vg g g g p t v e ee 50 01 52 ffe len yd net- et-1 et-1 ca r g o n n ve s go g- re es res r g v Slide credit: Andrea Vedaldi

34

Speed

PARRSLAB

5 ⨉ slower

800

5.0

700

4.4

600

3.8

500

3.1

Slower

speed (images/s on Titan X)

Speed • 5 ⨉ slower

1

400 300

2.5 1.9

200

1.3

100

0.6

0

0.0

eaff

c

x -f m ag 16 ag ag ag ale vgg ggt-d eep- 50-d 01-d 52-d v e t-1 -1 len yd er esne snet snet og v o r g g re re vg

x -f m ag 16 ag ag ag ale vgg ggt-d eep- 50-d 01-d 52-d v e t-1 -1 len yd er esne snet snet og v o r g g re re vg

c

eaff

Remark: 101 ResNet layers same size/speed as 16 VGG-VD layers Remark: 101 ResNet layers same size/speed as 16 VGG-VD layers Reason: far fewer feature channels (quadratic speed/space gain) Moral: optimize yourReason: architecture far fewer feature channels (quadratic speed/space gain) Slide credit: Andrea Vedaldi

35

PARRSLAB

Model size

Model size

18

Num. of parameters is about the same

500

6.0

438

5.3

375

4.5

313

3.8

Larger

model size (MBs)

• Num. of parameters is about the same

250 188

3.0 2.3

125

1.5

63

0.8

0

0.0

x -f m ag 16 ag ag ag ale vgg ggt-d eep- 50-d 01-d 52-d v e t-1 -1 len yd er esne snet snet og v o r g g re re vg

eaff

c

x -f m ag 16 ag ag ag ale vgg ggt-d eep- 50-d 01-d 52-d v e t-1 -1 len yd er esne snet snet og v o r g g re re vg

c

eaff

Remark: 101 ResNet layers same size/speed as 16 VGG-VD layers ResNet layers same size/speed as 16 VGG-VD Reason:Remark: far fewer101 feature channels (quadratic speed/space gain)layers Moral: optimize your architecture Reason: far fewer feature channels (quadratic speed/space gain)

36

PARRSLAB

Beyond CNNs • Do features extracted from the CNN generalize other tasks and datasets? − Donahue et al. (2013), Chatfield et al. (2014), Razavian et al. (2014), Yosinski et al. (2014), etc.

• CNN activations as deep features • Finetuning CNNs

37

PARRSLAB

CNN activations as deep features • CNNs discover effective representations. Why not to use them?

Slide credit: Jason Yosinski

38

PARRSLAB

CNN activations as deep features • CNNs discover effective representations. Why not to use them?

Slide credit: Jason Yosinski

39

PARRSLAB

CNN activations as deep features • CNNs discover effective representations. Why not to use them?

Layer 2

Layer 5

Zeiler et al., 2014 Slide credit: Jason Yosinski

40

PARRSLAB

CNN activations as deep features • CNNs discover effective representations. Why not to use them?

Layer 2

Zeiler et al., 2014

Layer 5

Last Layer Nguyen et al., 2014 Slide credit: Jason Yosinski

41

PARRSLAB

CNNs as deep features DeCAF: A Deep Convolutional Activation Feature for Gen • CNNs discover effective representations. Why not to use them? DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

t-SNE feature visualizations on the ILSVRC-2012

(a) LLC

LLC

(b) GIST

GIST

(c) DeCAF

1 Conv-1 activations

(d) DeCAF

6 Conv-6 activations

Figure 1. This figure shows several t-SNE feature visualizations on the ILSVRC-2012 validation set. (a) LLC , (b) GIST, and features derived from our CNN: (c) DeCAF1 , the first pooling layer, and (d) DeCAF6 , the second to last hidden layer (best viewed in color).

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, [Donahue et al.,’14]

42

Stability: Transfer learning

PARRSLAB

Transfer Learning with CNNs • a CNN a (large datasetto other generalizes • A CNN trained trained on on a (large enough)enough) dataset generalizes visual tasks to other visual tasks:

Figure 4. t-SNE map of 20, 000 Flickr test images based on features extracted from the last layer of an AlexNet trained with K = 1, 0 A full-resolution map is presented in the supplemental material. The inset shows a cluster of sports.

Learning visual features from Large Weakly supervised Data, [Joulin et al.,’15]

Slide credit: Joan Bruna

43

PARRSLAB

Transfer Learning with CNNs • Keep layers 1-7 of our ImageNet-trained model fixed • Train a new softmax classifier on top using the training images of the new dataset. 1. Train on Imagenet

2. Small dataset: feature extractor

3. Medium dataset: finetuning more data = retrain more of the network (or all of it) Freeze these

Freeze these tip: use only ~1/10th of the original learning rate in finetuning top layer, and ~1/100th on intermediate layers Train this

Train this Slide credit: Andrej Karpathy

44

PARRSLAB

CNNs in Computer Vision

*the original image is from the COCO dataset

He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. Classification (Krizhevsky et al., 2012) ObjectKaiming detection (Ren et al., 2015) Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015. Figure 4: (Left) Eight ILSVRC-2010 test images and the five labels considered most probable by our model. The correct label is written under each image, and the probability assigned to the correct label is also shown 45

0.6

1 0.8 0.6

CNNs in Computer Vision E−SVM HV+GC Ours−S Ours+S

0.6 0.8 1 Input image on Per Object

0.4

E−SVM HV+GC Ours−S Ours+S

0.2

0 0 Ground-truth

0.2

0.4 0.6 0.8 1 FCN Per Object EDeconvNet+CRF 1 − Recall

0.4 0.2

0 0

0.2 0.4 0.6 0.8 1 − Precision Per Object

E−SVM HV+GC Ours−S Ours+S 1

Accumulative Proportion of Object

0.8

Accumulative Proportion of Object

Accumulative Proportion of Object

1

1 0.8

PARRSLAB

0.6 0.4 E−SVM HV+GC Ours−S Ours+S

0.2 0 0

0.2

0.4 0.6 0.8 1 − Recall Per Object

1

ison of accumulative precision and recall distribution of the baselines and our method. Left: Polo dataset. Right: TUD hod’s results concentrate towards higher precision/recall rate.

P

Mi-AR

Ma-AP

Ma-AR

Avg-FP

Avg-FN

7 29.5 49.5 33.0 2.3 2.4 9 42.9 41.6 51.9 2.3 2.4 5 57.4 61.2 63.5 0.7 1.3 6 56.9 64.8 64.5 0.3 1.5 nce comparison on TUD pedestrian dataset. See .

ht number of instances. And it is at the price a) lower Examples that macro our method produces better results (a)than Examples FCN [17]. that our method results than FCN [17]. average recall. On produces the better other Figure Multi-Instance 6. Example results on TUD dataset. Left: input Semantic Segmentation (Noh et al., 2015) Segmentation (He and Gould, 2014)images; esting to see ’HV+GC’ has a lower average Middle: segmentation results; Right: overlay with template masks. ate. 46

Prinnalydata, ties. neunetthey d (2) CPU Most and ackracy

n erhree es in ned) tems etric o be , ocrfor-

PARRSLAB

CNNs in Computer Vision

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 1. Alignment pipeline. (a) The detected face, with 6 initial fidu-

cial points. (b) The induced 2D-aligned crop. (c) 67 fiducial points on the 2D-aligned crop with their corresponding Delaunay triangulation, we added triangles on the contour to avoid discontinuities. (d) The reference 3D shape transformed to the 2D-aligned crop image-plane. (e) Triangle visibility w.r.t. to the fitted 3D-2D camera; darker triangles are less visible. (f) The 67 fiducial points induced by the 3D model that are used to direct Figure 2. Outline of the DeepFace architecture. A front-end of a single convolution-pooling-convolution filtering on the rectified input, followed by Figure three 8. Visualization of pose results on images from LSP. Each pose is represented as a stick figure, inferre the piece-wise affine warpping. (g) The final frontalized crop. (h) A new locally-connected layers and two fully-connected layers. Colors illustrate feature maps produced at each layer. The net includes more than 120 million Face recognition (Taigman Pose estimation (Toshev Szegedy, 2014) Different limbs in the same image are colored differently, same and limb across different images has the same color. parameters, where more than 95%by come from the local and fully connected layers. view generated the 3D model (not used in this paper).et al., 2014) very few parameters. These layers merely expand the input into a set of simple local features. The subsequent layers (L4, L5 and L6) are instead lo-

The goal of training is to maximize the probability of the correct class (face id). We achieve this by minimizing the cross-entropy loss for each training sample. If k

to demonstrate the effectiveness of the features, we keep the

47

21.3 23.3 56.2 86.3

42.7 66.5

66.1

44.8 91.0

43.4 92.5

PARRSLAB

methods for text based image retrieval. We report mean average precision (mAP) for IC11, port top-n retrieval to compute precision at n (P@n) on Sports. Bold results outperform Experiments were performed by Mishra et al. in [40], not by the original authors.

CNNs in Computer Vision

Reading Text in the Wild with Convolutional Neural Networks

1.00/1.00/1.00

hollywood – P@100: 100%

1.00/1.00/1.00

boris johnson – P@100: 100%

1.00/0.88/0.93

1.00/1.00/1.00

Text detection and retrieval (Jaderberg et al., 2016)

g results from SVT-50 (top row) and IC11 (bottom row). Red dashed shows groundtruthFig. 12 The top two retrieval results for three queries on our BBC News dataset – h The frames and associated videos are retrieved from 5k hours of BBC video. We give and recognised results. P/R/F figures are given above each image.

queries, equivalent to the first page of results of our web application. 48

iv:1505.00468v4 [cs.CL] 18 Nov 2015

be provided in a multiple-choice format. We provide a dataset containing ⇠0.25M images, ⇠0.76M questions, and ⇠10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines for VQA are provided and compared with human performance. PARRSLAB

CNNs in Computer Vision F

1

I NTRODUCTION

We are witnessing a renewed excitement in multi-discipline Artificial Intelligence (AI) research problems. In particular, research in image and video captioning that combines Computer Vision (CV), Natural Language Processing (NLP), and Knowledge Representation & Reasoning (KR) has dramatically increased in the past year [14], [7], [10], [36], [24], [22], [51]. Part of this excitement stems from a belief that multi-discipline tasks like image captioning are a step towards solving AI. However, the current state of the art demonstrates that a coarse scene-level understanding of an image paired with word n-gram statistics suffices to generate reasonable image captions, which suggests image captioning may not be as “AI-complete” as desired. What makes for a compelling “AI-complete” task? We believe that in order to spawn the next generation of AI algorithms, an ideal task should (i) require multi-modal knowledge beyond a single (such as CV) and have a2015) well-defined Imagesub-domain Captioning (Karpathy and(ii) Fei-Fei, quantitative evaluation metric to track progress. For some tasks, such as image captioning, automatic evaluation is still a difficult and open research problem [49], [11], [20].

What color are her eyes? What is the mustache made of?

How many slices of pizza are there? Is this a vegetarian pizza?

Is this person expecting company? What is just under the tree?

Does it appear to be rainy? Does this person have 20/20 vision?

Fig. 1: Examples of free-form, open-ended questions collected for images via Question Amazon Mechanical Turk. Note et that Visual Answering (Antol al.,commonsense 2015) knowledge is needed along with a visual understanding of the scene to answer many questions.

pizza?”), and commonsense reasoning (e.g., “Does this person

49

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.