Idea Transcript
Generating Training Data for Deep Neural Networks by exploiting LIDAR, Cameras and Maps Andreas Geiger Autonomous Vision Group, MPI for Intelligent Systems, T¨ubingen Computer Vision and Geometry Group, ETH Z¨urich
October 23, 2017
Max Planck Institute for Intelligent Systems Autonomous Vision Group
2
Types of Machine Learning Supervised Learning (“Predictive”) I
Learn mapping from inputs x to outputs y from labeled data D = {xi , yi }N i=1
I
Examples: image classification, speech recognition, . . .
3
Types of Machine Learning Supervised Learning (“Predictive”) I
Learn mapping from inputs x to outputs y from labeled data D = {xi , yi }N i=1
I
Examples: image classification, speech recognition, . . .
Unsupervised Learning (“Descriptive”) I
Find interesting patterns in dataset D = {xi }N i=1
I
Examples: image segmentation, dimensionality reduction, . . .
3
Types of Machine Learning Supervised Learning (“Predictive”) I
Learn mapping from inputs x to outputs y from labeled data D = {xi , yi }N i=1
I
Examples: image classification, speech recognition, . . .
Unsupervised Learning (“Descriptive”) I
Find interesting patterns in dataset D = {xi }N i=1
I
Examples: image segmentation, dimensionality reduction, . . .
Reinforcement Learning I
Find suitable actions to maximize reward, discover via trial & error
I
Examples: robotic systems, AlphaGo, . . . 3
The Deep Learning Revolution
What is the problem?
Data
Data Data Data Data
Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
Labels
Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels Labels
Money
Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money Money
Image Classification
[Deng et al., CVPR 2009] 13
Semantic Segmentation
[Cordts et al., CVPR 2016]
14
Amazon Mechannical Turk
15
Data Annotation Industry
16
What about other tasks?
Stereo
18
Stereo
18
Stereo
18
Optical Flow
J. Gibson, 1950
MPI Sintel, 2012 19
Stereoautograph
www.wild-heerbrugg.com/photogrammetry1.htm 20
The Supervision Dilemma
Unsupervised Learning I
Great promises, but does not work (yet)
21
The Supervision Dilemma
Unsupervised Learning I
Great promises, but does not work (yet)
Supervised Learning I
Data annotation very labor expensive
21
The Supervision Dilemma
Unsupervised Learning I
Great promises, but does not work (yet)
Supervised Learning I
Data annotation very labor expensive
I
Consistency between annotators difficult
21
The Supervision Dilemma
Unsupervised Learning I
Great promises, but does not work (yet)
Supervised Learning I
Data annotation very labor expensive
I
Consistency between annotators difficult
I
Sometimes manual annotation virtually impossible (e.g., flow)
21
What can we do?
Synthetic Data
[Gaidon et al., CVPR 2016] 23
Self-Supervised Learning
[Doersch et al., ICCV 2015] 24
Self-Supervised Learning
[Zhou et al., CVPR 2017] 24
What else?
Supervision Transfer Idea I
Solve easier problem and transfer labels
26
Supervision Transfer Idea I
Solve easier problem and transfer labels
I
Requires algorithms which automate this transfer
26
Supervision Transfer Idea I
Solve easier problem and transfer labels
I
Requires algorithms which automate this transfer
I
Requires additional sensor modalities
26
Supervision Transfer Idea I
Solve easier problem and transfer labels
I
Requires algorithms which automate this transfer
I
Requires additional sensor modalities
Examples I
Domain transfer (e.g., 3D to 2D)
I
Resolution transfer (e.g., high resolution to low resolution)
I
Content transfer (e.g., 3D models to real world scenes)
26
Supervision Transfer Idea I
Solve easier problem and transfer labels
I
Requires algorithms which automate this transfer
I
Requires additional sensor modalities
Examples I
Domain transfer (e.g., 3D to 2D)
I
Resolution transfer (e.g., high resolution to low resolution)
I
Content transfer (e.g., 3D models to real world scenes)
26
Semantic Instance Annotation of Street Scenes by 3D to 2D Label Transfer [Xie, Kiefel, Sun & Geiger, CVPR 2016]
The Computer Vision Revolution
[Long et al., 2015] 28
3D to 2D Semantic and Instance Label Transfer Side View
Front View
3D View
29
3D to 2D Semantic and Instance Label Transfer Side View
3D View
Front View
Geometric Cues
3D Points
Labels
3D 2D
Dense Segmentation
29
3D to 2D Semantic and Instance Label Transfer Advantages over 2D annotation: I
Object instances can be more easily separated in 3D
30
3D to 2D Semantic and Instance Label Transfer Advantages over 2D annotation: I
Object instances can be more easily separated in 3D
I
A single annotated 3D object projects into many frames
30
3D to 2D Semantic and Instance Label Transfer Advantages over 2D annotation: I
Object instances can be more easily separated in 3D
I
A single annotated 3D object projects into many frames
I
2D annotations are temporally coherent
30
3D to 2D Semantic and Instance Label Transfer Advantages over 2D annotation: I
Object instances can be more easily separated in 3D
I
A single annotated 3D object projects into many frames
I
2D annotations are temporally coherent
Challenges: I
3D data is sparse, noisy and incomplete
30
3D to 2D Semantic and Instance Label Transfer Advantages over 2D annotation: I
Object instances can be more easily separated in 3D
I
A single annotated 3D object projects into many frames
I
2D annotations are temporally coherent
Challenges: I
3D data is sparse, noisy and incomplete
I
3D annotations are coarse
30
3D to 2D Semantic and Instance Label Transfer Advantages over 2D annotation: I
Object instances can be more easily separated in 3D
I
A single annotated 3D object projects into many frames
I
2D annotations are temporally coherent
Challenges: I
3D data is sparse, noisy and incomplete
I
3D annotations are coarse
I
Dynamic objects
30
3D Annotation: Static Scene Elements
I
Map prior for localizing buildings 31
3D Annotation: Static Scene Elements
I
Map prior for localizing buildings 31
3D Annotation: Detecting Dynamic Objects
32
3D Annotation: Detecting Dynamic Objects
32
3D Annotation: Annotating Dynamic Objects
33
2D Annotation: Scribbling the Rest
34
Model Variables: I
Pixels: {si }i∈P
Pixels Image
35
Model 3D Points
Variables: I
Pixels: {si }i∈P
I
3D points: {sl }l∈L Pixels Image
35
Model 3D Points
Variables: I
Pixels: {si }i∈P
I
3D points: {sl }l∈L
I
Scribbled pixels: {sj }j∈P 0
Pixels Image Scribbled Pixels
35
Model Variables: I
Pixels: {si }i∈P
I
3D points: {sl }l∈L
I
Scribbled pixels: {sj }j∈P 0
Gibbs Energy: X X X XX 0 ϕF E(s) = ϕP ϕL ϕP j (sj ) + mi (si ) i (si ) + l (sl ) + i∈P P,P + ψij (si , sj ) i,j∈P
X
l∈L
j∈P 0
L,L + ψlk (sl , sk ) l,k∈L
X
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
X
0
P,P ψij (si , sj )
i∈P,j∈P 0 35
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
l∈L
P,P + ψij (si , sj ) i,j∈P
X
X
X j∈P 0
L,L + ψlk (sl , sk ) l,k∈L
X
0
ϕP j (sj ) +
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
Pixel Unary Potentials: P P ϕP i (si ) = w1 (si ) ξi (si ) I
ξiP (si ): admissible labels
36
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
l∈L
P,P + ψij (si , sj ) i,j∈P
X
X
X j∈P 0
L,L + ψlk (sl , sk ) l,k∈L
X
0
ϕP j (sj ) +
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
0
P,P ψij (si , sj )
X
i∈P,j∈P 0
Pixel Unary Potentials: P P P P ϕP i (si ) = w1 (si ) ξi (si ) − w2 (si ) log pi (si ) I
ξiP (si ): admissible labels
I
pP i (si ): local appearance Training (Sparse) Test (Dense) 36
Potentials E(s) =
X
ϕP i (si ) +
i∈P P,P ψij (si , sj ) + i,j∈P
X
X l∈L
ϕL l (sl ) +
X j∈P 0
L,L + ψlk (sl , sk ) l,k∈L
X
0
ϕP j (sj ) +
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
3D Point Unary Potentials: L L ϕL l (sl ) = w (sl ) ξl (sl )
I
ξlL (sl ) = 0 ⇔ 3D point lies within a 3D primitive of class sl
36
Potentials E(s) =
X
ϕP i (si ) +
i∈P P,P ψij (si , sj ) + i,j∈P
X
X
ϕL l (sl ) +
l∈L
X j∈P 0
L,L + ψlk (sl , sk ) l,k∈L
X
0
ϕP j (sj ) +
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
3D Point Unary Potentials: L L ϕL l (sl ) = w (sl ) ξl (sl )
I
ξlL (sl ) = 0 ⇔ 3D point lies within a 3D primitive of class sl
I
Sky not admissible 36
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
X
0
ϕP j (sj ) +
j∈P 0
l∈L
P,P ψij (si , sj ) + i,j∈P
X
X
L,L + ψlk (sl , sk ) l,k∈L
X
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
Sparsely Labeled Pixel Unary Potentials: 0
0
0
P P ϕP i (si ) = w (si ) ξi (si )
Key frame 36
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
X
0
ϕP j (sj ) +
j∈P 0
l∈L
P,P ψij (si , sj ) + i,j∈P
X
X
L,L + ψlk (sl , sk ) l,k∈L
X
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
Sparsely Labeled Pixel Unary Potentials: 0
0
0
P P ϕP i (si ) = w (si ) ξi (si )
Key frame
Non-key frame 36
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
l∈L
P,P ψij (si , sj ) + i,j∈P
X
X
X j∈P 0
L,L + ψlk (sl , sk ) l,k∈L
X
0
ϕP j (sj ) +
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
0
P,P ψij (si , sj )
X
i∈P,j∈P 0
Geometric Unary Potentials: F ϕF mi (si ) = w (si ) [pi ∈ Rm ∧ νm (pi ) 6= si ]
Curb
Fold Side View
36
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
X
0
ϕP j (sj ) +
j∈P 0
l∈L
P,P ψij (si , sj ) + i,j∈P
X
X
L,L + ψlk (sl , sk ) l,k∈L
X
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
0
P,P ψij (si , sj )
X
i∈P,j∈P 0
Geometric Unary Potentials: F ϕF mi (si ) = w (si ) [pi ∈ Rm ∧ νm (pi ) 6= si ] Minimum Bounding Disc
Wall Road
Sidewalk
A
2D Fold
B Label Boundary 36
Potentials ϕP i (si ) +
X
P,P ψij (si , sj ) +
X
E(s) =
X i∈P
+
X
i,j∈P
ϕL l (sl ) +
X
0
ϕP j (sj ) +
j∈P 0
l∈L
ϕF mi (si )
m∈F i∈P
L,L ψlk (sl , sk ) +
l,k∈L
XX
X
ψilP,L (si , sl ) +
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
i∈P,l∈L
Pixel Pairwise Potentials: ( P,P ψij (si , sj )
=
w1P,P (si , sj ) exp
− (
+ I
w2P,P (si , sj ) exp
−
kpi − pj k2
)
2 θ1P,P kpi − pj k2 2 θ2P,P
−
kci − cj k2
)
2 θ3P,P
Fully connected model with appearance and smoothness kernel 36
Potentials ϕP i (si ) +
X
P,P ψij (si , sj ) +
X
E(s) =
X i∈P
+
X
i,j∈P
ϕL l (sl ) +
X
0
ϕP j (sj ) +
j∈P 0
l∈L
ϕF mi (si )
m∈F i∈P
L,L ψlk (sl , sk ) +
l,k∈L
XX
X
ψilP,L (si , sl ) +
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
i∈P,l∈L
Pixel Pairwise Potentials: ( P,P ψij (si , sj )
=
w1P,P (si , sj ) exp
− (
+
w2P,P (si , sj ) exp
−
kpi − pj k2
)
2 θ1P,P kpi − pj k2 2 θ2P,P
−
kci − cj k2
)
2 θ3P,P
I
Fully connected model with appearance and smoothness kernel
I
Gaussian potentials ensure tractable inference [Kr¨ahenb¨uhl & Koltun, NIPS 2011] 36
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
l∈L
P,P ψij (si , sj ) + i,j∈P
X
X
X
0
ϕP j (sj ) +
j∈P 0
L,L + ψlk (sl , sk ) l,k∈L
X
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L
+
(
(nl − nk )2
X
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
3D Pairwise Potentials: L,L ψlk (sl , sk )
I
=w
L,L
(sl , sk ) exp −
3d 2 kp3d l − pk k
2 θ1L,L
−
)
2 θ2L,L
Fully connected model in 3D
36
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
l∈L
P,P ψij (si , sj ) + i,j∈P
X
X
X
0
ϕP j (sj ) +
j∈P 0
L,L + ψlk (sl , sk ) l,k∈L
X
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L
+
(
(nl − nk )2
X
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
3D Pairwise Potentials: L,L ψlk (sl , sk )
=w
L,L
(sl , sk ) exp −
3d 2 kp3d l − pk k
I
Fully connected model in 3D
I
Encourages same labels for closeby 3D points with similar normals
2 θ1L,L
−
)
2 θ2L,L
36
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
X
0
ϕP j (sj ) +
j∈P 0
l∈L
P,P ψij (si , sj ) + i,j∈P
X
X
L,L + ψlk (sl , sk ) l,k∈L
X
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
2D/3D Pairwise Potentials: ψilP,L (si , sl )
I
=w
P,L
kpi − π l k2 (si , sl ) exp − 2 θP,L
Fully connected 2D/3D field
36
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
X
0
ϕP j (sj ) +
j∈P 0
l∈L
P,P ψij (si , sj ) + i,j∈P
X
X
L,L + ψlk (sl , sk ) l,k∈L
X
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
2D/3D Pairwise Potentials: ψilP,L (si , sl )
=w
P,L
kpi − π l k2 (si , sl ) exp − 2 θP,L
I
Fully connected 2D/3D field
I
Encourages label consistency in a neighborhood of the projection π l
36
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
X
0
ϕP j (sj ) +
j∈P 0
l∈L
P,P ψij (si , sj ) + i,j∈P
X
X
L,L + ψlk (sl , sk ) l,k∈L
X
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
2D Scribbled Pairwise Potentials: P,P 0 ψij (si , sj )
I
( =w
P,P 0
(si , sj ) exp −
2 kpi − pL jk
)
2 θP,P 0
Fully connected 2D/sparse 2D field
36
Potentials E(s) =
X
ϕP i (si ) +
i∈P
ϕL l (sl ) +
X
0
ϕP j (sj ) +
j∈P 0
l∈L
P,P ψij (si , sj ) + i,j∈P
X
X
L,L + ψlk (sl , sk ) l,k∈L
X
XX
ϕF mi (si )
m∈F i∈P
+ ψilP,L (si , sl ) i∈P,l∈L X
+
X
0
P,P ψij (si , sj )
i∈P,j∈P 0
2D Scribbled Pairwise Potentials: P,P 0 ψij (si , sj )
( =w
P,P 0
(si , sj ) exp −
I
Fully connected 2D/sparse 2D field
I
Propagate sparse annotation of pL j to its neighborhood
2 kpi − pL jk
)
2 θP,P 0
36
Learning and Inference Inference: I
Factorized mean field Q(s) =
Q
i∈P∪L Qi (si )
37
Learning and Inference Inference: Q
I
Factorized mean field Q(s) =
I
Efficient variational inference [Kr¨ahenb¨uhl & Koltun, CVPR 2011]
i∈P∪L Qi (si )
37
Learning and Inference Inference: Q
I
Factorized mean field Q(s) =
I
Efficient variational inference [Kr¨ahenb¨uhl & Koltun, CVPR 2011]
i∈P∪L Qi (si )
Learning: I
Θ = {w1P , w2P , wL , wF , w1P,P , w2P,P , wP,L , w1L,L , w2L,L }
37
Learning and Inference Inference: Q
I
Factorized mean field Q(s) =
I
Efficient variational inference [Kr¨ahenb¨uhl & Koltun, CVPR 2011]
i∈P∪L Qi (si )
Learning: I I
Θ = {w1P , w2P , wL , wF , w1P,P , w2P,P , wP,L , w1L,L , w2L,L } Empirical risk minimization (univariate logistic loss) N X X f (Θ) = − log Qn,i (s∗n,i ) + λ C(Θ) n=1 i∈P I
s∗n,i : ground truth label
Qn,i (·): approximate marginal
37
Learning and Inference Inference: Q
I
Factorized mean field Q(s) =
I
Efficient variational inference [Kr¨ahenb¨uhl & Koltun, CVPR 2011]
i∈P∪L Qi (si )
Learning: I I
Θ = {w1P , w2P , wL , wF , w1P,P , w2P,P , wP,L , w1L,L , w2L,L } Empirical risk minimization (univariate logistic loss) N X X f (Θ) = − log Qn,i (s∗n,i ) + λ C(Θ) n=1 i∈P I
I
s∗n,i : ground truth label
Qn,i (·): approximate marginal
Stochastic gradient descent 37
Learning and Inference Inference: Q
I
Factorized mean field Q(s) =
I
Efficient variational inference [Kr¨ahenb¨uhl & Koltun, CVPR 2011]
i∈P∪L Qi (si )
Learning: I I
Θ = {w1P , w2P , wL , wF , w1P,P , w2P,P , wP,L , w1L,L , w2L,L } Empirical risk minimization (univariate logistic loss) N X X f (Θ) = − log Qn,i (s∗n,i ) + λ C(Θ) n=1 i∈P I
s∗n,i : ground truth label
Qn,i (·): approximate marginal
I
Stochastic gradient descent
I
Same loss for instance & semantic segmentation 37
Quantitative Results Method LA LA+PW LA+PW+CO+3D Full Model Full Model (90%) Full Model (80%) Full Model (70%)
JI
Acc
82.1 84.4 88.2 89.0
90.0 91.4 93.7 94.1
94.9 97.4 96.6 98.2 97.5 98.7
I
LA: Local Appearance
I
PW: 2D Pairwise Potentials
I
CO: 3D Primitive Constraints
I
3D: 3D Point Constraints 38
Qualitative Comparison to Baselines
2D Label Propagation [Vijayanarasimhan et al., 2012]
39
Qualitative Comparison to Baselines
Projection of 3D Primitives
39
Qualitative Comparison to Baselines
Proposed Method
39
Qualitative Results
40
Qualitative Results
40
Qualitative Results
40
Qualitative Results
40
Qualitative Results
41
Qualitative Results
41
Qualitative Results for Dynamic Scenes
42
Qualitative Results for Dynamic Scenes
42
Qualitative Results for Dynamic Scenes
42
Dataset Statistics
Dataset
#frames
semantic
CamVid
631 × 1 500 × 2 5000 × 1 20000 × 1 55218 × 2
X X X X X
DUS CityScape (fine) CityScape (coarse) Ours
instance
X X X
consecutive 3D annotation
X X
?
X
X
43
Video
44
Supervision Transfer Idea I
Transfer labels from a problem for which they are easy to obtain
I
Develop algorithms which facilitate this transfer
Examples I
Domain transfer (e.g., 3D to 2D)
I
Resolution transfer (e.g., high resolution to low resolution)
I
Content transfer (e.g., 3D models to real world scenes)
45
Slow Flow: Exploiting High-Speed Cameras for Accurate and Diverse Optical Flow Reference Data [Janai, G¨uney, Wulff, Black & Geiger, CVPR 2017]
Slow Flow
47
Slow Flow
47
Slow Flow
47
Slow Flow
48
Slow Flow
49
Slow Flow
50
Slow Flow
51
Sparsity Invariant CNNs [Uhrig, Schneider, Franke, Brox & Geiger, 3DV 2017]
Laser Scans are Sparse
I
Goal: Interpolation of sparse / irregular depth map
53
Standard CNNs fail on Sparse Inputs
Input (5% sparsity)
Ground truth
Standard ConvNet
SparseConvNet
54
Sparse Convolutions I
Regular Convolution: fu,v (x) =
k X
xu+i,v+j wi,j + b
i,j=−k
x : Input
wi,j : Filter weights
b: Bias 55
Sparse Convolutions I
Regular Convolution: fu,v (x) =
k X
xu+i,v+j wi,j + b
i,j=−k
I
Sparse Convolution: Pk
i,j=−k ou+i,v+j xu+i,v+j Pk i,j=−k ou+i,v+j +
fu,v (x, o) = o fu,v (o) =
x : Input
max
i,j=−k,..,k
wi,j : Filter weights
wi,j
+b
ou+i,v+j
b: Bias
o : Observability 55
Sparse Convolution Module
feature
mask
1 0 -2 2 -1 2 1 2 3
weights
feature
1 1 1 1 1 1 1 1 1
normalization
max pool
I
bias
mask
Can be easily implemented using regular operations 56
Network Architecture
16 1
16 1
16 1
I
Standard 2D denoising CNN architecture
I
Yellow: Depth image (input/output)
I
Red: Observation mask
I
Green: Feature maps
16 1
h
w 1
16 1
mask
w
max pool: 3
h
3
h
3
h
max pool: 3
w
max pool: 5
w
w 5
h
7
h
h
11
11
max pool: 7
w
w
max pool: 11
1
57
Results on Synthia
I
Results on Synthia dataset [Ros et al., CVPR 2016]
I
Trained and evaluated at 5% sparsity
58
Results on Synthia
I
Results on Synthia dataset [Ros et al., CVPR 2016]
I
Trained at 5% sparsity and evaluated at 20% sparsity 59
KITTI Depth Dataset
I I
KITTI Depth dataset: 93k images with depth ground truth Depth prediction/completion benchmark available soon! 60
Results on Synthia-to-KITTI Depth Adaptation
ConvNet
ConvNet + mask
Sparsity at train: ConvNet ConvNet+mask SparseConvNet
5%
10%
SparseConvNet 20%
30%
40%
50%
60%
70%
16.03 13.48 10.97 8.437 10.02 9.73 9.57 9.90 16.18 16.44 16.54 16.16 15.64 15.27 14.62 14.11 0.722 0.723 0.732 0.734 0.733 0.731 0.731 0.730
MAE (m) I
Trained on Synthia with different sparsity levels
I
Evaluated on KITTI Depth dataset 61
Supervision Transfer Idea I
Transfer labels from a problem for which they are easy to obtain
I
Develop algorithms which facilitate this transfer
Examples I
Domain transfer (e.g., 3D to 2D)
I
Resolution transfer (e.g., high resolution to low resolution)
I
Content transfer (e.g., 3D models to real world scenes)
62
Augmented Reality Meets Deep Learning for Car Instance Segmentation in Urban Scenes [Alhaija, Mustikovela, Mescheder, Geiger & Rother, BMVC 2017]
Augmented Reality Meets Deep Learning
64
Augmented Reality Meets Deep Learning
64
Augmented Reality Meets Deep Learning
I
Instance segmentation performance (MNC, Dai et al., CVPR 2016) 65
Augmented Reality Meets Deep Learning
65
Augmented Reality Meets Deep Learning
65
One more thing ...
A Multi-View Stereo Benchmark with High-Resolution Images and Multi-Camera Videos [Sch¨ops, Sch¨onberger, Galliani, Sattler, Schindler, Pollefeys & Geiger, CVPR 2017]
ETH3D Benchmark
www.eth3d.net 68
Thank you!