Deep Structured Learning Chunhua Shen School of Computer Science, - - PowerPoint PPT Presentation

deep structured learning
SMART_READER_LITE
LIVE PREVIEW

Deep Structured Learning Chunhua Shen School of Computer Science, - - PowerPoint PPT Presentation

Deep Structured Learning Chunhua Shen School of Computer Science, The University of Adelaide www.cs.adelaide.edu.au/~chhshen 1 Street scene understanding 2 Pedestrian detection Our 2014 improved result, best one before CVPR2015


slide-1
SLIDE 1

Deep Structured Learning

Chunhua Shen School of Computer Science, The University of Adelaide www.cs.adelaide.edu.au/~chhshen

1

slide-2
SLIDE 2

街景认知 Street scene understanding

2

slide-3
SLIDE 3

Pedestrian detection

3

Our 2014 improved result, best

  • ne before CVPR2015
  • S. Paisitkriangkrai, C. Shen, A. van den Hengel.

Pedestrian Detection with Spatially Pooled Features and Structured Ensemble Learning, IEEE T. PAMI 2015 http://arxiv.org/abs/1409.5209

slide-4
SLIDE 4

Pedestrian Detection: average miss rate (Caltech data)

slide-5
SLIDE 5

Car detection

5

Kitti car dataset (average precision) with 70% overlap. We are ranked 2nd currently.

slide-6
SLIDE 6
slide-7
SLIDE 7

Speed sign

slide-8
SLIDE 8

Caution sign

slide-9
SLIDE 9

Car and sign detection

slide-10
SLIDE 10

Car and sign detection

slide-11
SLIDE 11

Car and sign detection

slide-12
SLIDE 12

Text in the wild

12

BigData (0.5M Labelled data) + Deep Learning

Method Precision Recall F-measure Our Method 0.84 0.70 0.76 Max et al. (ECCV2014) 0.89 0.66 0.75 Huang et al. (ECCV2014) 0.84 0.67 0.75 Neumann et al. (ICDAR2011) 0.65 0.64 0.63

  • ICDAR2003 Text in wild detection results

Oxford Visual Factory, acquired by Google 2014

slide-13
SLIDE 13

Text in the wild

13

Left: – green rectangles: our results – red rectangles: ground truth. Right: results of http://textspotter.org/

slide-14
SLIDE 14

Text in the wild

14

Left: – green rectangles: our results – red rectangles: ground truth. Right: results of http://textspotter.org/

slide-15
SLIDE 15

Car license plate detection

15 Method Precision Recall Our Method 0.960 0.952 Zhou (TIP2012) 0.955 0.848 Lim (ICSUDET2010) 0.837 0.905

  • Caltech Cars-markus 1999 dataset (http://www.vision.caltech.edu/html-files/archive.html)

Subset_AC Subset_LE Method Precision Recall Method Precision Recall Ours 0.987 0.985 Ours 0.977 0.976 Hsu (TVT2013) 0.91 0.96 Hsu (TVT2013) 0.91 0.95

  • AOLP dataset (http://aolpr.ntust.edu.tw/lab/)
slide-16
SLIDE 16

Car license plate detection 车牌检测

16

slide-17
SLIDE 17

17

Face recognition

Deep learning based system with 400k labelled data. Faces in the wild (LFW) Face verification task: ~98.2% (Single model) Best reported result at CVPR2015, deep learning + 200M labelled data): Google FaceNet: ~99.7

slide-18
SLIDE 18

18

Large scale Image Classification

Top 5 error rate: ~8% on ImageNet ILSVRC2014 (parallel training on Workstations with 8 K40 GPUs) Best reported result (Google): ~4.8%

slide-19
SLIDE 19

Image Captioning

Image Extract Image Features Attributes/Labels/Locati

  • ns prediction

Language Modeling Image Captions A person is sitting at a dining table with plate of

  • fruit. A

tablet and bouquet

  • f

red flowers are on the side. CNN LSTM

slide-20
SLIDE 20

Fig: Qualitative results for images in Microsoft COCO dataset. Selected attributes and corresponding detection scores are shown at the right side of each image. Our generated caption (in black), the Baseline results (in blue) and a human caption (in red) are shown below.

slide-21
SLIDE 21

Our team currently achieves the best result on 3 evaluation metrics (BLEU81,2,3)

  • ut of 7. We also achieve the top85 ranking on the other evaluation metrics.
slide-22
SLIDE 22

Visual Question Answering

Q: What kind of glasses are they drinking out of ?

Image Extract Image Features Attributes/Labels/Loc ations prediction Language Modeling Question Understanding CNN LSTM Knowledge Base Generate Answer

A: Wine

slide-23
SLIDE 23

Visual Question Answering

Top -5 Attributes

  • couch, sleeping , brown, laying, dog

Question Answering

  • Does this animal appear to be

resting ?

  • yes (yes)
  • What is the color of the cushions ?
  • brown (brown)

Generated caption

  • a dog laying on top of a couch next to a pillow

. Top -5 Attributes

  • group, people, children, cake, grass

Question Answering

  • What kind of event does this look like?
  • birthday party (birthday)
  • Is this woman trying to give the kids an

uncooked cake?

  • no (no)

Generated caption

  • a group of children standing around a cake.
slide-24
SLIDE 24

Visual Question Answering

Top -5 Attributes

  • top, cake, fruits, table, plates

Question Answering

  • What are the orange sticks?
  • carrots (carrots)
  • How many carrots are in the bowls?
  • 2 (over 10)
  • Is this set up for a party?
  • yes (yes)

Generated caption

  • a table topped

with plates of food. Top -5 Attributes

  • shelf, small, book, room, television

Question Answering

  • What pattern is on the curtain?
  • floral (leaves)
  • What sport is being displayed on the

television?

  • football (football)
  • Is there a bookcase nearby?
  • yes (yes)

Generated caption

  • a living room with a

couch and a television.

slide-25
SLIDE 25

Table . Results on the open/answer task for various question types on MS COCO/VQA. All results are the percentage of answers in agreement with human subjects. ‡ indicates ground truth attributes labels are

  • used. Human results arealso reported for reference.

Image Captioning with an Intermediate Attributes Layer

Qi Wu, Chunhua Shen, Anton van den Hengel, Lingqiao Liu, Anthony Dick http://arxiv.org/abs/1506.01144

slide-26
SLIDE 26

Deep Structured Learning

slide-27
SLIDE 27

Feed forward neural networks

Supervised learning Fully-connected network

Multi-layer perceptron (MLP)

Convolutional Neural Network

LeNet [LeCun98]

Other networks:

Recurrent Network (speech/text parsing)

Background: Neural Networks

slide-28
SLIDE 28

Convolutional Neural Network

Fully connected net Convolutional neural net (1-Dimensional convolution) Number of weights: 15 (model parameter) Number of weights: 3 CNN layer properties:

  • 1. sparse connectivity (spatially-local connection)
  • 2. sharing weights (same colour: shared weights)

(regardless the spatial position) Another aspect Input: 5*1

  • utput: 3*1

Conv layer Apply a filter: 3*1

slide-29
SLIDE 29

Network layers

A network contains not only Conv/FC layers

Pooling layer Cross-channel normalisation layer Drop-out layer ...

slide-30
SLIDE 30

CNN learning

Stochastic gradient decent (SGD)

Mini-batch: a small number of examples for one gradient update Momentum (avoid gradient fluctuation)

Parameter update in the t-th iteration: Momentum:0.9 Learning rate: (0.01, decreasing) Weight decay:0.0005

slide-31
SLIDE 31

Summary

CNN is “simple”

...

The whole network: repeating simple building block

slide-32
SLIDE 32

Summary

CNN is difficult

design the architecture hard to train (non-convex) GPUs

...

Why 224 Why 11*11 Why 9 layers?? Why 5*5

slide-33
SLIDE 33

Depth estimation from single monocular images

  • Depth acquisition:

– Depth sensors, e.g., Kinect – Machine learning methods

  • Most vision datasets are still RGB images
  • Estimate depth from single RGB images

– Ill-posed problem

slide-34
SLIDE 34

Depth Estimation From Single Monocular Images

slide-35
SLIDE 35

Depth Estimation From Single Monocular Images

  • Useful

– Scene understanding – 3D modelling – Benefit other vision tasks

  • e.g., semantic labellings, pose estimations
  • Challenging

– No reliable depth cues

  • e.g., stereo correspondence, motion information
slide-36
SLIDE 36

Our method

  • Joint learning: Continuous CRF + deep CNN
  • Exact maximization of log-likelihood
  • Closed form solution for MAP inference
slide-37
SLIDE 37

11/09/2015 19

Deep$Convolutional$Neural$Fields

slide-38
SLIDE 38

Continuous(CRF

  • Given(image(x(with(labels(((((((((((((((((((((((((((((((((((((((((

CRF(model(the(conditional(probability(density( function

  • Z(x)(the(partition(function

– Z(x)(integrable here((y(is(continuous)

slide-39
SLIDE 39

Continuous(CRF

  • Energy(function((unary+pairwise):
  • Depth(estimation((MAP(inference):
slide-40
SLIDE 40

Continuous(CRF

  • Energy(function((unary+pairwise):
  • Depth(estimation((MAP(inference):
slide-41
SLIDE 41

Potential)functions

  • Pairwise)potential
slide-42
SLIDE 42

Learning

  • Minimize+the+negative+condition+log3likelihood

– Optimization:+back+propagation

slide-43
SLIDE 43

Depth&prediction

  • MAP&inference
slide-44
SLIDE 44

Model&speedup&using&fully/conv& networks&and&superpixel&pooling

  • Problem&with&DCNF&model:&inefficient

– Need&to&perform&convolutions&for&each&superpixel&

image&patch

  • Model&speedup:&DCNF/FCSP

– Deep&convolutional&neural&fields&with&fully&

convolutional&networks&and&superpixel&pooling

– Only&need&to&perform&convolutions&over&the&entire&

image&once

slide-45
SLIDE 45

DCNF%FCSP

  • Fully%conv/nets//////////////convolution/maps
  • Superpixel/pooling//////////////convolution/features/
  • f/superpixels
slide-46
SLIDE 46

Fully%conv*nets

  • Perform*convolutions*over*the*input*image*to*
  • btain*convolution*maps
  • Deeper*networks,*e.g.,*VGG%16,*GoogleNet,*

can*be*used

slide-47
SLIDE 47

Superpixel)pooling

  • Average)pooling)within)superpixels
slide-48
SLIDE 48

Baseline(comparison

  • NYU(v2
slide-49
SLIDE 49

DCNF%vs.%DCNF)FCSP

  • Training%time%comparison
slide-50
SLIDE 50

State%of%the%art*comparison

  • NYU*v2
slide-51
SLIDE 51

37

State%of%the%art*comparison

  • Make3D
slide-52
SLIDE 52

38

Prediction*examples:*NYU*v2

slide-53
SLIDE 53

Prediction*examples:*Make*3D

slide-54
SLIDE 54

Conclusion

  • Deep convolutional neural fields for monocular image

depth estimations

  • Combine deep CNN and continuous CRF
  • General learning framework

Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields Fayao Liu, Chunhua Shen, Guosheng Lin CVPR2015 http://arxiv.org/abs/1502.07411

slide-55
SLIDE 55

Deep structured model

  • CNNs+CRFs

– formulate CNN based potential functions in CRFs

  • Convolutional Neural Networks (CNNs): feature learning
  • Conditional random fields (CRFs): model relations
  • Application on semantic segmentation

– capture the contextual information to improve performance

slide-56
SLIDE 56

CRFs+CNNs

Conditional likelihood: Energy function: CNN based (log-) potential function (factor function): The potential function can be a unary, pairwise, or high-order potential function y1 y2 CNN based pairwise potential, measure the confidence of the pairwise label configuration CNN based unary potential: measure the labelling confidence of a single variable Factor graph: a factorization of the joint distribution of variables

slide-57
SLIDE 57

Challenges in Learning CRFs+CNNs

Prediction can be made by marginal inference (e.g. message passing): CRF-CNN joint learning: learning CNN potential functions by optimizing the CRF objective, typically, minimizing the negative conditional log-likelihood (NLL) Learning CNN parameters with stochastic gradient descend. The partition function Z brings difficulties for optimization: For each SGD iteration: require approximate marginal inference to calculate the factor marginals. CNN training need a large number of SGD iterations, training become intractable.

slide-58
SLIDE 58
  • Traditional approach:

– Applying approximate learning objectives

  • Replace the optimization objectives to avoid inference
  • e.g., piecewise training, pseudo-likelihood
  • Our approach

– Directly target the final prediction

  • Traditional approach aims to learn the potentials function and perform inference for

final prediction

– Not learning the potential function – Learning CNN estimators to directly output the required intermediate

values in an inference algorithm

  • Focus on message passing based inference for prediction (specifically Loopy BP).
  • Directly learning CNNs to predict the messages.

Solutions

slide-59
SLIDE 59

Piecewise training of CRF

a variable conditioned only on the neighbours within a single factor

slide-60
SLIDE 60

scale level 1 Image pyramid scale level 2 scale level 3 Network Part-1: 6 conv blocks Feature maps Network Part-2: 2 FC layers

...

Features for

  • ne node/edge

...

One input image

One Unary/Pairwise Potential Network:

d d d 3d d d

upsample upsample concat

3 3 3

Combined feature map

  • Fig. 4 – An illustration of the details of one unary or one pairwise potential network. An input image is first resized into 3 scales, then each

resized image goes through 6 convolution blocks to output 3 feature maps. Then a CRF graph is constructed and node or edge features are generated from the feature maps. Node or edge features go through a network to generate the unary or pairwise potential network

  • utputs. Finally the network outputs are fed into a CRF loss function in the training stage, or an MAP inference objective for prediction.

Conv block 1: 3 x 3 conv 64 3 x 3 conv 64 2 x 2 pooling Conv block 2: 3 x 3 conv 128 3 x 3 conv 128 2 x 2 pooling Conv block 3: 3 x 3 conv 256 3 x 3 conv 256 3 x 3 conv 256 2 x 2 pooling

Network Part-1: Network Part-2:

2 fully-connected layers: Fc 512 Fc 21(unary) or Fc 441(pairwise) Conv block 4: 3 x 3 conv 512 3 x 3 conv 512 3 x 3 conv 512 2 x 2 pooling Conv block 5: 3 x 3 conv 512 3 x 3 conv 512 3 x 3 conv 512 2 x 2 pooling Conv block 6: 7 x 7 conv 4096 3 x 3 conv 512 3 x 3 conv 512

  • Fig. 5 – The detailed configuration of networks. “Network Part-1” and “Network Part-2” are described in Fig. 4. Here “Network Part-2”

contains two layers, with number of output units equal to the number of classes K (unary) or equal to K2 (pairwise).

Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation, arXiv 2015

slide-61
SLIDE 61

belief propagation: message passing based inference

y1 y2 Factor-to-variable message Variable-to-factor message: Message: K-dimensional vector, K is the number of classes (node states) Factor-to-variable message: marginal distribution (beliefs) of one variable: Variable-to-factor message y3 A simple example of the marginal inference on the node y2:

slide-62
SLIDE 62

CNN message estimators

  • Directly learn a CNN function to output the message vector

– Don't need to learn the potential functions

The factor-to-variable message: A message prediction function formualted by a CNN dependent message feature vector: encodes all dependent messages from the neighboring nodes that are connected to the node p by the factor F Input image region

slide-63
SLIDE 63

Learning CNN message estimator

Define the cross entropy loss between the ideal marginal and the estimated marginal: The optimization problem for learning: The variable marginals estimated by CNN:

slide-64
SLIDE 64

Application on semantic segmentation

  • Semantic segmentation: pixel labelling
  • CNNs+CRFs: capture contextual information to improve the label

prediction of one image patch

Network: C1 (Multi-scale FCNNs) Feature map d Create the CRF graph (create nodes and pairwise connections)

slide-65
SLIDE 65

Create nodes in the CRF graph One node corresponds to one spatial position in the feature map

… …

Generate pairwise connection One node connects to the nodes that lie in a spatial range box (box with the dashed lines) Feature map d Create the CRF graph (create nodes and pairwise connections)

slide-66
SLIDE 66

Network: C1 (Multi-scale FCNNs) Feature map d Create the CRF graph (create nodes and pairwise connections) Perform message passing inference with the message estimator Network: C2 (fully-connected layers) predict all factors-to-variable messages Calculate the marinals on all nodes prediction

Network C1 and C2 need to learn

slide-67
SLIDE 67

Segmentation examples

slide-68
SLIDE 68
slide-69
SLIDE 69

video

slide-70
SLIDE 70

Leaderboard

Deeply Learning the Messages in Message Passing Inference Guosheng Lin, Chunhua Shen, Ian Reid, Anton van den Hengel NIPS 2015. http://arxiv.org/abs/1506.02108

slide-71
SLIDE 71

Conclusions

I am a big fan of Deep Learning + Big Data

  • GPU hungry: tens of K40’s, Titan, K80’s

Dense (per-pixel) prediction: Comp. Vis. research direction has been shifting to per-pixel prediction from per-image prediction. Thus structured learning plays a more important role