Two Parallel Deep Convolutional Neural Networks ... - Academia Sinica [PDF]

when training a single deep convolutional neural network (CNN) model. In this paper, we present a deep ... success on bo

0 downloads 5 Views 415KB Size

Recommend Stories


Convolutional neural networks
Never wish them pain. That's not who you are. If they caused you pain, they must have pain inside. Wish

Convolutional neural networks
When you talk, you are only repeating what you already know. But if you listen, you may learn something

Lecture 5: Convolutional Neural Networks
Just as there is no loss of basic energy in the universe, so no thought or action is without its effects,

Calibration of Convolutional Neural Networks
Learning never exhausts the mind. Leonardo da Vinci

Local Binary Convolutional Neural Networks
Learn to light a candle in the darkest moments of someone’s life. Be the light that helps others see; i

Convolutional Neural Networks for Brain Networks
No amount of guilt can solve the past, and no amount of anxiety can change the future. Anonymous

Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks
No amount of guilt can solve the past, and no amount of anxiety can change the future. Anonymous

Analyzing and Introducing Structures in Deep Convolutional Neural Networks
Kindness, like a boomerang, always returns. Unknown

Deep Convolutional Neural Networks for Aspect Category and Sentiment Extraction
Don't watch the clock, do what it does. Keep Going. Sam Levenson

Deep Learning of Graphs with Ngram Convolutional Neural Networks
At the end of your life, you will never regret not having passed one more test, not winning one more

Idea Transcript


Two Parallel Deep Convolutional Neural Networks for Pedestrian Detection Bo-Yao Lin

Chu-Song Chen

Institute of Information Science Academia Sinica, Taipei, Taiwan Email: [email protected]

Institute of Information Science & Research Center for Information Technology Innovation Academia Sinica, Taipei, Taiwan Email: [email protected]

Abstract—Pedestrian detection attracts lots of attentions in the field of computer vision in recent years. It is difficult to handle data imbalance between positive and negative examples and easy-to-confused negative samples for pedestrian detection when training a single deep convolutional neural network (CNN) model. In this paper, we present a deep learning approach that combines two parallel deep CNN models for pedestrian detection. We propose using two deep CNNs, and each of which is capable of solving a particular mission-oriented task to form parallel classification models. Then, the models are integrated to build a more robust pedestrian detector. Experimental results on the Caltech dataset demonstrate the effectiveness of our approach for pedestrian detection compared to other state-of-the-art deep CNN methods.

I. I NTRODUCTION Pedestrian detection based on images is an active research area and become more and more popular within the last decade. A robust pedestrian detector is a key component applicable to the fields such as automotive safety [1], robotics [2], and visual surveillance [3]. Pedestrian detection problem is regarded as a canonical object detection task, where differentiating human bodies from backgrounds is its main goal. Detecting humans in various scenes is challenging on the grounds that humans may have various postures with partial occlusions. Furthermore, shapes of several human-like objects (such as postboxes and traffic lights) on the street may also be confused with the human bodies due to large ambiguities between these two types of objects, making the problem even challenging. Deep convolutional neural networks (CNN) have received considerable attentions in recent years, and have shown their effectiveness for objection recognition recently. Deep CNNs have the ability to learn discriminative feature representations from raw RGB pixels in an end-to-end manner, and they unify the feature-extraction and classification in a single learning model. Deep CNNs have been widely used in image classification [4] [5], object detection and localization [6] [7], and segmentation [8]. Several deep architectures have been proposed for general object recognition tasks, eg., AlexNet [5], GoogLeNet [9], and VGG [10]. GoogLeNet has a great success on both the image-classification and object-detection

tasks in the recent ILSVRC2014, where there are 1.2 million images of 1000 classes [4], Due to the promising results of general deep CNN models, a beneficial method for training deep CNN is to adopt the parameters learned from the general image classification problem, which are pre-trained on Imagenet, as the initial weights. Then, it fine-tunes the weights for the transfer-learning purpose to solve the problems in other related domains. Such a parameter-transfer-learning scheme has achieved advantageous effects in several object detection tasks, such as RCNN [6]. Hence, in this paper, we follow this transfer-learning principle and pre-train on Imagenet to learn a pedestrian detector. In [11], it provides a fine view of architectures and parameters to implement pedestrian detection on deep CNN models through a wide range of experiments. However, based on our observation, direct parameter-transferring or fine-tuning usually cannot solve the problem well due to the discrepancy between the pedestrian detection and classification problem. First, pedestrian detection is a 2 classes problem including pedestrian and background, where the background class is an universal concept consisting of unlimited kinds of objects or scenes. It is hard to reflect the variations of the backgrounds from a general image classification model that learns from specific 1000 object classes. Furthermore, the imbalanced problem resulting from the background training data (that are far more than the pedestrian training data) also make the finetuned network ineffective. To address this problem, we use GoogLeNet in a different organization in this paper. An advantageous characteristic of GoogLeNet is that it is a flexible deep learning architecture consisting of the early layers (for learning early representations), middle layers (for deep feature extraction), and final layers (for integration and classification). In particular, “inception” is a repetitive and re-useable structure, which is employed to construct the middle layers, where the number of inceptions can be alternated to fit different task goals. In this work, we suggest to use a GoogLeNet with fewer inceptions, which can largely reduce the parameters which need to be learned through the training stage and show that it is more effective for pedestrian detection. Besides, due to above observation, we found that a single deep CNN model learned from

c 978-1-5090-0357-0/15/$31.00 2015 IEEE

GoogLeNet

Inception Sigmoid CrossEntropy Loss

Conv 1x1

Image

DepthConcat Conv 1x1 Conv 3x3

224

224

Conv 7x7 Conv 3x3 Conv 1x1 MaxPool 3x3 MaxPool 3x3

(a) Early Layers

Conv 1x1 Conv 5x5

AveragePool 5x5 Conv 1x1

MaxPool 3x3 Conv 1x1

(b) Middle Layers

FC 1024

FC 1

(c) Final Layers

Fig. 1. The structure of GoogLeNet

the imbalance training data is hard to differentiate the critical negative examples from the positive example of pedestrians. To overcome this difficulty, we introduce another deep CNN with different missions assigned and propose to combine these deep CNNs, and then mix the final-layer classification results of these models to form a pedestrian detector. We call the approach Two Parallel Deep Convolutional Neural Networks (TPDCNN). Finally, we validate its performance on Caltech dataset [12], which is a representative benchmark publicly available for pedestrian detection. II. R ELATED W ORK In this section, we review several methods designed for pedestrian detection. Existing methods for pedestrian detection can be divided into two categories: handcrafted-features and deep-learning approaches. We review these methods from these two distinct directions. Handcrafted features, eg., Haarlike features [13], SIFT [14], HOG [15], HOG-LBP [16], have been widely employed for pedestrian detection. To handle more complex articulations of human parts, several deformation models [17] [18] [19] are introduced. These deformable part models (DPM) learn a mixture of local templates for the body parts. Classifiers such as SVM [15], boosting classifiers [20], random forests [21] are then used to determine whether a pedestrian is detected. Integral channel features [20] have become a popular method to extract efficient features for pedestrian detection. Dollar et al. propose Aggregated Channel Features (ACF) [22] that comprise gradient histogram, gradients, and LUV. Then, they learn the classifiers in a boosting manner. Ke [23] et al. use convolutional network architecture with orthogonal PCA filters to improve ACF. Zhang et al. propose Checkerboards [24] that use filtered channel features with HOG+LUV as low-level features. They find that the

number of filters appears to be the most important variable, and checkerboard-like patterns or purely random filters can achieve good performance. They also add optical-flow features to further enhance the performance. Deep-learning methods perform end-to-end learning to tackle the pedestrian detection problem. In ConvNet [25], the authors use CNN to handle the limited training data on INRIA dataset. They also test the performance on Caltech pedestrian dataset while training on INRIA. DBN-Isol [26] uses a stack of Restricted Boltzmann Machines (RBMS) to extend deformable parts models (DPMs). They design overlapping parts at multiple layers and then, verify the visibility of a part for multiple times at distinct layers. DBN-Mut [27] extends the DBN-Isol to account for person-to-person relations. JointDeep [28] is proposed to use deep networks to jointly learn the feature extraction, deformation handling, occlusion handling, and classification. Zeng et al. [29] adopt deep model that can jointly train multi-stage classifiers through several stages of backpropagation and use contextual features computed at different scales. SDN [30] uses “switchable layers” to jointly learn low level features and high level parts. Hosang et al. [11] employ CifarNet and R-CNN with AlexNet for pedestrian detection. They also discuss the performance impact of pedestrian detection on different architectures and parameters. Tian et al. introduce TA-CNN [31] where further pedestrian attributes (e.g. “carrying backpack”) and scene attributes (e.g. ‘vehicle’ and ‘tree’) are introduced to solve the confusions between the positive and hard negative samples. LFOV [32] proposes a Large-Field-of-View deep network to make classification decisions simultaneously and accurately at multiple locations. Angelova et al. [33] propose to cascade deep nets and fast features to speed up the detection process while maintaining good detection performance.

TPDCNN

224

Image

Upper Row

Average Fusion

ACF Proposals

224

Early Layers

480

Middle Layers

Final Layers

640

Lower Row

Fig. 2. The overall architecture of Two Parallel Deep Convolutional Neural Networks (TPDCNN)

III. M ETHOD In this section, we would illustrate our concept and details of the proposed TPDCNN. Before introducing our method, a brief review of GoogLeNet is given in advance. First, an image fed into the network goes through the early layers shown in Figure 1(a). In early layers, Some early representations are formed through a sequence of 7 × 7, 1 × 1, and 3 × 3 convolutions and max pooling operators. Then, middle layers which is formed by a repetitive structure called inception is concatenated with the early layers. An inception has several parallel 1 × 1, 3 × 3, and 5 × 5 convolutions and max pooling summarized with a depth concatenation stage, as shown in Figure 1(b). A depth concatenation is then performed and the output is sent to the next inception. The inception serves as a basic unit module to extract deep-representation features, and 9 inceptions are used in GoogLeNet. The final layers (shown in Figure 1(c)) integrate the deep features extracted by average pooling and a fully-connected layer, and then linear classifiers are constructed for the classification. The scale of learning parameters, i.e. weight, in neural networks is a pivotal factor. It may degrade the performance of networks if the number of learning parameters is too larger with limited training data. Compared to the AlexNet model having almost 60 Mega parameters, GoogLeNet reduced the learned parameters to approximately 5 Mega with a more deeper network. In the original GoogLeNet, it uses softmax to classify 1000 categories on ImageNet. For pedestrian detection, it only needs two classes, pedestrian and background, which is a binary classification problem. In our work, we replace the softmax layer with a sigmoid cross-entropy loss in the final linear classifier layer. Nevertheless, using the whole

GoogLeNet for pedestrian detection often suffers from severe over-fitting problem, even though the networks are fine-tuned from a diverse and large-scale dataset, ImageNet. The result of utilizing GoogLeNet to train a pedestrian detector is shown in Section IV. In the following, we introduce a concise model to tackle this problem in Section III-A. Then, the joint models are depicted in Section III-B.

A. Concise model for pedestrian detection The original GoogLeNet is a mighty model that can differentiate various objects. However, the features learned in the final layers of the GoogLeNet would be too restrictive to be fine-tuned to a pedestrian detector. Nevertheless, it would have a better generalization ability of the features learned in some previous layers (based on the ImageNet) as they are not that restrictive to the particular object classes. This inspired us to use a concise model for handling the pedestrian detection problem. In our architecture, the model is reduced to a concise network with fewer inceptions in the middle layer (where 2 inceptions are used here). We retained the early layers and final layers in our concise model. This makes the training process easier and faster because the overall parameters are reduced. Besides, we still make use of pretraining and initialize the weights of the early and middle layers with that learned from the ImageNet. The concise model is illustrated in the upper row of Fig. 2. Compared to GoogLeNet that achieves the log-average miss-rate (MR) of 36.66%, our concise model can considerately reduce the MR to 27.31% on Caltech test set.

B. Two Parallel Deep Convolutional Neural Networks Inter-class correlation, where some background regions are similar to the pedestrians, is a main issue for pedestrian detection. Region proposals consisting of a large portion of background area and a small portion of pedestrian area are also critical negative samples resulting in the inter-class correlation problem, and this may incur localization error. The concise model introduced in Section III-A still suffers from the critical negative background samples that have large ambiguity and are easy to be confused with the foreground pedestrians. We adopt a mission-oriented strategy for training the networks to deal with this difficulty. The networks trained with different goals are combined to form the TPDCNN architecture. Currently, TPDCNN has two rows (with each row a separated model) as shown in Fig. 2. Without loss of generality, it can be added with more rows for different tasks if necessary. The mission of the first deep model is to discriminate the pedestrian samples with general and easy negative background samples, while the mission of the other model is to discriminate them with critical negative examples only. We illustrate their details as follows. Upper row of TPDCNN: The upper row of TPDCNN consists of the early layers, middle layers of 2 inceptions, and final layers, as depicted in Section III-A. To train this model, we use all of the available positive (pedestrian) samples and negative (background) samples. The ground-truth windows provided by the dataset are employed as positive samples. In [11], it shows that adding some jitters, that is a positive proposal of large overlapping area with ground truth, may degrade the detection performance. Instead, we employ horizontal mirror to augment these positive samples. To collect the negative samples, we follow the conventions in relevant studies such as [31] that an object-proposal method is used to produce multiple candidate windows at first. Then, the windows whose overlapping degrees with the ground truths are less than a threshold (here, 0.3) are chosen to be the negative samples. Both positive and negative samples are normalized into the size of 224×224 and fed into the deep CNNs. The object-proposal method and procedure of normalization on input images used in this work will be introduced in Section III-C. Lower row of TPDCNN: We bestow a different mission on the lower row of TPDCNN. The lower row of TPDCNN focuses on a more difficult goal where harder negative samples can be separated with the positive ones in the learned feature space (but the easier negative samples could be sacrificed). As the mission assigned is more difficult, we increase one additional inception and adopt a more complex network with three inceptions in the middle layer, as shown in the lower part of Fig. 2. Due to the relatively scarce amount of positive samples, the positive samples used for training this row remain the same as that used in the upper row, but the negative samples are chosen as the ones fail to be classified correctly by the first row. Hence, the mission of the lower row is disparate to that the first row and they are trained independently. There are several advantages of the TPDCNN model, and

we are discussed as follows. For building a pedestrian detector, hard negative human-like background regions are always a main issue. Increasing the number of hard negative samples in the training stage by changing the overlapping degree threshold may improve the performance but also causes a large imbalance between the positive and negative samples. The proposed TPDCNN can handle the data imbalance problem more properly and thus get better performance. Furthermore, if the two models are dependent or coherent, combining them becomes less meaningful. The lower network has a stronger ability to differentiate such ambiguity (although would suffer from worse performance on distinguishing general negative and positive samples) because only the negative data that are ambiguous to the upper row are used in training the lower network. These two rows are thus incoherent and mutually beneficial to each other. Finally, the final score is generated by averaging the final layers of these two parallel networks. C. Preprocessing and postprocessing We illustrate the implementation details of prep- and postprocessing steps to complete the pedestrian detection task. 1) Preprocessing: To detect all of the pedestrians in an image, a possible way is to use a sliding window to test numerous potential regions within an image. However, it is exhaustive and slow because of a large amount of computations of convolution operations in deep CNN models. Many recent studies generate the object proposals (candidate regions) to avoid the exhaustive search. SelectiveSearch [34] is the most popular proposal method and widely used in object detection. Using class-specific proposals allows to reduce the number of proposals by large magnitude. In this work, we follow the approach of [31] that uses the ACF detector [22] to generate the object proposal (via a loose threshold) for pedestrian detection. The model windows we choose for pedestrian detectors is 128x64 pixels in which pedestrians have height 100 and width 41. The height of the region proposal extracted by ACF is normalized to 224 and the aspect ratio is preserved. As the input to the network is 224 × 224, the empty parts occurred in the aspect-ratio resizing are compensated with the mean color used in the GoogLeNet training. 2) Postprocessing: According to the observation on the work of objection detection [6], deep neural networks are not reliable enough on localizing objects. To increase the precision of the detected region, we also adopt the bounding-box (BBox) regression method suggested in RCNN [6] and train a ridgeregression model that predicts a new detection bounding box based on the feature extracted in the convolution layer outputs. To avoid multiple responses in an adjacent local area, we run a greedy Non-Maximum Suppression (NMS) procedure [35] to suppress the large overlapping areas of detections with lower scores. Finally, we concatenate the preprocessing (object proposal), TPDCNN, and postprocessing (BBox regression + NMS) steps to form the entire approach. To implement TPDCNN, we modify the GoogLeNet implementation [36] on the Caffe [37] platform.

Networks Upper row Lower row

# of inceptions 2 3 2 3 TABLE I

Test MR 27.31% 29.52% 29.84% 25.38%

Networks ConvNet [25] DBN-Isol [26] DBN-Mut [27] MultiSDP [29] JointDeep [28] SDN [30] LFOV [32] DeepCascade [33] DeepCascade+ [33] R-CNN (AlexNet) [11] TA-CNN [31] TPDCNN TABLE III

T HE SELECTION OF CONCISE MODELS FROM G OOG L E N ET Networks ACF [22] GoogLeNet [9] Upper row Lower row TPDCNN

BBox Regression N N N Y N Y N Y TABLE II

Test MR 29.76% 36.66% 27.31% 24.55% 25.38% 24.73% 21.25% 19.57%

Test MR 77.20% 53.14% 48.22% 45.39% 39.32% 37.87% 35.85% 31.11% 26.21% 23.30% 20.86% 19.57%

C OMPARISON TO OTHER DEEP LEARNING METHODS ON C ALTECH DATASET ( RESONABLE )

D ETECTION QUALITY ON DIFFERENT DEEP ARCHITECTURES , WHERE ‘Y’ MEANS EMPLOYING BOUNDING BOX REGRESSION AND NMS AFTER CNN STAGE , AND ‘N’ MEANS NOT EMPLOYING THEM

1 .80 .64 .50

IV. E XPERIMENTAL R ESULTS

.40

A. Dataset

.30

miss rate

The Caltech dataset [12] is a well-designed pedestrian detection benchmark, and its associated benchmark provides an elaborate tool to evaluate the ability of pedestrian detections. This dataset consists of several video clips, which are obtained from a car traversing U.S. streets under good weather condition, and it contains lots of suburban and city scenes. The “Reasonable” setting of the training set contains 4250 frames with roughly 2 · 103 annotated pedestrians in these sampling frames. Here, we follow the same settings with [11], and sample one out of three frames from the training video clips. There are a total of 2 · 104 positive training samples extracted from 42782 frames, and about 1.25 · 105 negative samples which is mined by the ACF proposal, respectively.

.20

.10

.05

77.20% ConvNet 53.14% DBN−Isol 48.22% DBN−Mut 45.39% MultiSDP 39.32% JointDeep 37.87% SDN 35.85% LFOV 31.11% DeepCascade 26.21% DeepCascade+ 23.32% AlexNet+ImageNet 20.86% TA−CNN 19.57% TPDCNN −3

10

−2

10

−1

10

0

10

1

10

false positives per image

Fig. 3. Overall Performance on Caltech-Test compared with deep models

B. Evaluations of TPDCNN In this section, we display the performance of our proposed TPDCNN and the comparison to other deep learning methods for pedestrian detection. We start to show the selection of concise models from GoogLeNet, which are used to form two parallel deep CNNs . Table I shows the performance with a different number of inceptions for upper row and lower row individually. We can see that the MR is closer on upper row when the number of inceptions in middle layers is 2 or 3. However, as the lower row is assigned to deal with more hard negative samples, the more inceptions (here, 3) is advantageous to extract better features for discrimination. Then, we show the detection quality on different deep architectures and our proposed TPDCNN in Table II. We can see that the original GoogLeNet structure is not promising in pedestrian detection compared to other structures (even when fine-tuning is performed), and it is even worse than the proposal ACF. It shows that the original deeper GoogLeNet is not a proper architecture for pedestrian detection. Our upper-row model achieves a more favorable performance than

GoogLeNet. The lower-row model (trained on the positive and critical negative samples only) can achieve comparable performance of the upper-row model. Both models perform better when bounding-box regression is used. By using TPDCNN, the joint scheme boosts the performance compared with the above deep models. Similarly, introducing the bounding-box regression on TPDCNN can also achieve a more favorable result. In the final part, we compare our approach, TPDCNN, with other pedestrian detection approaches using deep learning. These deep learning methods include ConvNet [25], DBNIsol [26], DBN-Mut [27], MultiSDP [29], JointDeep [28], SDN [30], LFOV [32], DeepCascade, DeepCascade+ [33], RCNN (AlexNet) [11], and TA-CNN [31]. Table III shows the results. Our TPDCNN achieves 19.57% with the bounding box regression, which performs more favorable than the other competitive deep learning methods for pedestrian detection [32], [33], [11], [31]. Figure 3 shows the overall performance on Caltech test dataset (reasonable) with all deep models.

V. C ONCLUSION AND F UTURE W ORK In this paper, we present the TPDCNN approach as a robust pedestrian detector. In order to address the problems resulting from data imbalance and critical negative examples in training a pedestrian classifier, TPDCNN combines two deep CNN models that are assigned with distinct missions during the training phase. We introduce a concise structure from GoogLeNet with fewer inceptions and build our proposed models based on this structure at first. The proposed TPDCNN model is then combined with ACF object proposal and bounding box regression to enhance the efficiency and reliability of pedestrian detection. The performance is boosted as we integrate these two deep CNNs to form TPDCNN. Our approach achieves more favorable performance on Caltech dataset, a widely adopted public benchmark, than the other deep-learning pedestrian-detection approaches compared. In the future, we plan to fortify this method by combining the two parallel networks as a whole rather than just averaging the outputs of the individual networks. Introducing more than two parallel networks may also improve this method. Also, considering dynamic features (such as optical flow) and additional pedestrian- or scene-related attributes as a multitask learning is also potentially useful for further performance improvement. ACKNOWLEDGMENT This paper was supported in part under the project MOST 104-2221-E-001-023-MY2. R EFERENCES [1] E. Coelingh, A. Eidehall, and M. Bengtsson, “Collision warning with full auto brake and pedestrian detection - a practical example of automatic emergency braking,” in ITSC, 2010. [2] A. Ess, B. Leibe, K. Schindler, , and L. van Gool, “A mobile vision system for robust multi-person tracking,” in CVPR, 2008. [3] S. Chen, K. Lin, C. Chen, and Y. Hung, “Location-aware object detection via coherent region grouping,” in ICASSP, 2015. [4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. F.F., “ImageNet Large Scale Visual Recognition Challenge,” IJCV, 2015. [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012. [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014. [7] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” in CVPR, 2014. [8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” PAMI, 2013. [9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015. [10] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015. [11] J. Hosang, M. Omran, R. Benenson, and B. Schiele, “Taking a deeper look at pedestrians,” in CVPR, 2015. [12] P. Doll´ar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” PAMI, 2012. [13] P. Viola, M. Jones, and D. Snow, “Detecting pedestrians using patterns of motion and appearance,” in ICCV, 2003. [14] D. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004. [15] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.

[16] X. Wang, T. Han, and S. Yan, “An hog-lbp human detector with partial occlusion handling,” in ICCV, 2009. [17] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” PAMI, 2010. [18] Z. Lin and L. Davis, “Shape-based human detection and segmentation via hierarchical part-template matching,” PAMI, 2010. [19] L. Zhu, Y. Chen, and A. Yuille, “Learning a hierarchical deformable template for rapid deformable object parsing,” PAMI, 2010. [20] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,” in BMVC, 2009. [21] P. Doll´ar, R. Appel, and W. Kienzle, “Crosstalk cascades for frame-rate pedestrian detection,” in ECCV, 2012. [22] P. Doll´ar, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” PAMI, 2014. [23] W. Ke, Y. Zhang, P. Wei, Q. Ye, and J. Jiao, “Pedestrian detection via pca filters based convolutional channel features,” in ICASSP, 2015. [24] S. Zhang, R. Benenson, and B. Schiele, “Filtered channel features for pedestrian detection,” in CVPR, 2015. [25] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, “Pedestrian detection with unsupervised multi-stage feature learning,” in CVPR, 2013. [26] W. Ouyang and X. Wang, “A discriminative deep model for pedestrian detection with occlusion handling,” in CVPR, 2012. [27] W. Ouyang, X. Zeng, and X. Wang, “Modeling mutual visibility relationship with a deep model in pedestrian detection,” in CVPR, 2013. [28] W. Ouyang and X. Wang, “Joint deep learning for pedestrian detection,” in ICCV, 2013. [29] X. Zeng, W. Ouyang, and X. Wang, “Multi-stage contextual deep learning for pedestrian detection,” in ICCV, 2013. [30] P. Luo, Y. Tian, X. Wang, and X. Tang, “Switchable deep network for pedestrian detection,” in CVPR, 2014. [31] Y. Tian, P. Luo, X. Wang, and X. Tang, “Pedestrian detection aided by deep learning semantic tasks,” in CVPR, 2015. [32] A. Angelova, A. Krizhevsky, and V. Vanhouck, “Pedestrian detection with a large-field-of-view deep network,” in ICRA, 2015. [33] A. Angelova, A. Krizhevsky, V. Vanhoucke, A. Ogale, and D. Ferguson, “Real-time pedestrian detection with deep network cascades,” in BMVC, 2015. [34] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, “Selective search for object recognition,” IJCV, 2013. [35] P. Doll´ar, “Piotr’s Computer Vision Matlab Toolbox (PMT),” http: //vision.ucsd.edu/∼pdollar/toolbox/doc/index.html. [36] Princeton Vision Group, “A gpu implementation of googlenet,” http: //vision.princeton.edu/pvt/GoogLeNet. [37] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM MM, 2014.

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.