Deep Structured Learning
Chunhua Shen School of Computer Science, The University of Adelaide www.cs.adelaide.edu.au/~chhshen
1
Deep Structured Learning Chunhua Shen School of Computer Science, - - PowerPoint PPT Presentation
Deep Structured Learning Chunhua Shen School of Computer Science, The University of Adelaide www.cs.adelaide.edu.au/~chhshen 1 Street scene understanding 2 Pedestrian detection Our 2014 improved result, best one before CVPR2015
Chunhua Shen School of Computer Science, The University of Adelaide www.cs.adelaide.edu.au/~chhshen
1
街景认知 Street scene understanding
2
Pedestrian detection
3
Our 2014 improved result, best
Pedestrian Detection with Spatially Pooled Features and Structured Ensemble Learning, IEEE T. PAMI 2015 http://arxiv.org/abs/1409.5209
Pedestrian Detection: average miss rate (Caltech data)
Car detection
5
Kitti car dataset (average precision) with 70% overlap. We are ranked 2nd currently.
Speed sign
Caution sign
Car and sign detection
Car and sign detection
Car and sign detection
Text in the wild
12
BigData (0.5M Labelled data) + Deep Learning
Method Precision Recall F-measure Our Method 0.84 0.70 0.76 Max et al. (ECCV2014) 0.89 0.66 0.75 Huang et al. (ECCV2014) 0.84 0.67 0.75 Neumann et al. (ICDAR2011) 0.65 0.64 0.63
Oxford Visual Factory, acquired by Google 2014
Text in the wild
13
Left: – green rectangles: our results – red rectangles: ground truth. Right: results of http://textspotter.org/
Text in the wild
14
Left: – green rectangles: our results – red rectangles: ground truth. Right: results of http://textspotter.org/
Car license plate detection
15 Method Precision Recall Our Method 0.960 0.952 Zhou (TIP2012) 0.955 0.848 Lim (ICSUDET2010) 0.837 0.905
Subset_AC Subset_LE Method Precision Recall Method Precision Recall Ours 0.987 0.985 Ours 0.977 0.976 Hsu (TVT2013) 0.91 0.96 Hsu (TVT2013) 0.91 0.95
Car license plate detection 车牌检测
16
17
Face recognition
Deep learning based system with 400k labelled data. Faces in the wild (LFW) Face verification task: ~98.2% (Single model) Best reported result at CVPR2015, deep learning + 200M labelled data): Google FaceNet: ~99.7
18
Image Extract Image Features Attributes/Labels/Locati
Language Modeling Image Captions A person is sitting at a dining table with plate of
tablet and bouquet
red flowers are on the side. CNN LSTM
Fig: Qualitative results for images in Microsoft COCO dataset. Selected attributes and corresponding detection scores are shown at the right side of each image. Our generated caption (in black), the Baseline results (in blue) and a human caption (in red) are shown below.
Our team currently achieves the best result on 3 evaluation metrics (BLEU81,2,3)
Q: What kind of glasses are they drinking out of ?
Image Extract Image Features Attributes/Labels/Loc ations prediction Language Modeling Question Understanding CNN LSTM Knowledge Base Generate Answer
A: Wine
Top -5 Attributes
Question Answering
resting ?
Generated caption
. Top -5 Attributes
Question Answering
uncooked cake?
Generated caption
Top -5 Attributes
Question Answering
Generated caption
with plates of food. Top -5 Attributes
Question Answering
television?
Generated caption
couch and a television.
Table . Results on the open/answer task for various question types on MS COCO/VQA. All results are the percentage of answers in agreement with human subjects. ‡ indicates ground truth attributes labels are
Image Captioning with an Intermediate Attributes Layer
Qi Wu, Chunhua Shen, Anton van den Hengel, Lingqiao Liu, Anthony Dick http://arxiv.org/abs/1506.01144
Feed forward neural networks
Supervised learning Fully-connected network
Multi-layer perceptron (MLP)
Convolutional Neural Network
LeNet [LeCun98]
Other networks:
Recurrent Network (speech/text parsing)
Convolutional Neural Network
Fully connected net Convolutional neural net (1-Dimensional convolution) Number of weights: 15 (model parameter) Number of weights: 3 CNN layer properties:
(regardless the spatial position) Another aspect Input: 5*1
Conv layer Apply a filter: 3*1
A network contains not only Conv/FC layers
Pooling layer Cross-channel normalisation layer Drop-out layer ...
Stochastic gradient decent (SGD)
Mini-batch: a small number of examples for one gradient update Momentum (avoid gradient fluctuation)
Parameter update in the t-th iteration: Momentum:0.9 Learning rate: (0.01, decreasing) Weight decay:0.0005
CNN is “simple”
The whole network: repeating simple building block
CNN is difficult
design the architecture hard to train (non-convex) GPUs
...
Why 224 Why 11*11 Why 9 layers?? Why 5*5
– Depth sensors, e.g., Kinect – Machine learning methods
– Ill-posed problem
– Scene understanding – 3D modelling – Benefit other vision tasks
– No reliable depth cues
11/09/2015 19
CRF(model(the(conditional(probability(density( function
– Z(x)(integrable here((y(is(continuous)
– Optimization:+back+propagation
– Need&to&perform&convolutions&for&each&superpixel&
image&patch
– Deep&convolutional&neural&fields&with&fully&
convolutional&networks&and&superpixel&pooling
– Only&need&to&perform&convolutions&over&the&entire&
image&once
can*be*used
37
38
depth estimations
Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields Fayao Liu, Chunhua Shen, Guosheng Lin CVPR2015 http://arxiv.org/abs/1502.07411
– formulate CNN based potential functions in CRFs
– capture the contextual information to improve performance
Conditional likelihood: Energy function: CNN based (log-) potential function (factor function): The potential function can be a unary, pairwise, or high-order potential function y1 y2 CNN based pairwise potential, measure the confidence of the pairwise label configuration CNN based unary potential: measure the labelling confidence of a single variable Factor graph: a factorization of the joint distribution of variables
Prediction can be made by marginal inference (e.g. message passing): CRF-CNN joint learning: learning CNN potential functions by optimizing the CRF objective, typically, minimizing the negative conditional log-likelihood (NLL) Learning CNN parameters with stochastic gradient descend. The partition function Z brings difficulties for optimization: For each SGD iteration: require approximate marginal inference to calculate the factor marginals. CNN training need a large number of SGD iterations, training become intractable.
– Applying approximate learning objectives
– Directly target the final prediction
final prediction
– Not learning the potential function – Learning CNN estimators to directly output the required intermediate
values in an inference algorithm
a variable conditioned only on the neighbours within a single factor
scale level 1 Image pyramid scale level 2 scale level 3 Network Part-1: 6 conv blocks Feature maps Network Part-2: 2 FC layers
...
Features for
...
One input image
One Unary/Pairwise Potential Network:
d d d 3d d d
upsample upsample concat
3 3 3
Combined feature map
resized image goes through 6 convolution blocks to output 3 feature maps. Then a CRF graph is constructed and node or edge features are generated from the feature maps. Node or edge features go through a network to generate the unary or pairwise potential network
Conv block 1: 3 x 3 conv 64 3 x 3 conv 64 2 x 2 pooling Conv block 2: 3 x 3 conv 128 3 x 3 conv 128 2 x 2 pooling Conv block 3: 3 x 3 conv 256 3 x 3 conv 256 3 x 3 conv 256 2 x 2 pooling
Network Part-1: Network Part-2:
2 fully-connected layers: Fc 512 Fc 21(unary) or Fc 441(pairwise) Conv block 4: 3 x 3 conv 512 3 x 3 conv 512 3 x 3 conv 512 2 x 2 pooling Conv block 5: 3 x 3 conv 512 3 x 3 conv 512 3 x 3 conv 512 2 x 2 pooling Conv block 6: 7 x 7 conv 4096 3 x 3 conv 512 3 x 3 conv 512
contains two layers, with number of output units equal to the number of classes K (unary) or equal to K2 (pairwise).
Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation, arXiv 2015
belief propagation: message passing based inference
y1 y2 Factor-to-variable message Variable-to-factor message: Message: K-dimensional vector, K is the number of classes (node states) Factor-to-variable message: marginal distribution (beliefs) of one variable: Variable-to-factor message y3 A simple example of the marginal inference on the node y2:
– Don't need to learn the potential functions
The factor-to-variable message: A message prediction function formualted by a CNN dependent message feature vector: encodes all dependent messages from the neighboring nodes that are connected to the node p by the factor F Input image region
Define the cross entropy loss between the ideal marginal and the estimated marginal: The optimization problem for learning: The variable marginals estimated by CNN:
Application on semantic segmentation
prediction of one image patch
Network: C1 (Multi-scale FCNNs) Feature map d Create the CRF graph (create nodes and pairwise connections)
Create nodes in the CRF graph One node corresponds to one spatial position in the feature map
… …
Generate pairwise connection One node connects to the nodes that lie in a spatial range box (box with the dashed lines) Feature map d Create the CRF graph (create nodes and pairwise connections)
Network: C1 (Multi-scale FCNNs) Feature map d Create the CRF graph (create nodes and pairwise connections) Perform message passing inference with the message estimator Network: C2 (fully-connected layers) predict all factors-to-variable messages Calculate the marinals on all nodes prediction
Network C1 and C2 need to learn
Leaderboard
Deeply Learning the Messages in Message Passing Inference Guosheng Lin, Chunhua Shen, Ian Reid, Anton van den Hengel NIPS 2015. http://arxiv.org/abs/1506.02108
✦
I am a big fan of Deep Learning + Big Data
✦
Dense (per-pixel) prediction: Comp. Vis. research direction has been shifting to per-pixel prediction from per-image prediction. Thus structured learning plays a more important role