ION I N S I D E - O U T S I D E N E T : D E T E C T I N G O B J E - - PowerPoint PPT Presentation

ion
SMART_READER_LITE
LIVE PREVIEW

ION I N S I D E - O U T S I D E N E T : D E T E C T I N G O B J E - - PowerPoint PPT Presentation

ION I N S I D E - O U T S I D E N E T : D E T E C T I N G O B J E C T S I N C O N T E X T W I T H S K I P P O O L I N G A N D R E C U R R E N T N E U R A L N E T W O R K S S E A N B E L L ( C O R N E L L U N I V E R S I T Y ) K AV I TA


slide-1
SLIDE 1

I N S I D E - O U T S I D E N E T : D E T E C T I N G O B J E C T S I N C O N T E X T W I T H S K I P P O O L I N G A N D R E C U R R E N T N E U R A L N E T W O R K S

S E A N B E L L ( C O R N E L L U N I V E R S I T Y ) K AV I TA B A L A ( C O R N E L L U N I V E R S I T Y ) L A R RY Z I T N I C K ( M I C R O S O F T R E S E A R C H , N O W A T FA I R ) 
 R O S S G I R S H I C K ( M I C R O S O F T R E S E A R C H , N O W A T FA I R )

ION

slide-2
SLIDE 2

I O N T E A M

Sean Bell Ross Girshick Larry Zitnick (Microsoft Research,
 now both at FAIR) Kavita Bala (Cornell University)

slide-3
SLIDE 3

S U M M A RY: M S C O C O D E T E C T I O N

test-competition test-dev Runtime Competition 31.0% 31.2% 2.7 s Post-Competition 33.1% 5.5 s

Best Student Entry (3rd Place Overall) Key pieces:

  • New ION detector (+5.1 mAP)
  • Better proposals, more data (+3.9 mAP)
  • Better training/testing (+4.1 mAP)

(single ConvNet model, no ensembling)

http://arxiv.org/pdf/1512.04143.pdf Tech report:

slide-4
SLIDE 4

I O N D E T E C T O R

+5.1 mAP on COCO test-dev compared to Fast R-CNN

slide-5
SLIDE 5

FA S T R - C N N [ G I R S H I C K 2 0 1 5 ]

conv5 fc6 fc7 bbox cls

Can we improve on feature extraction?

  • For small objects, the footprint on conv5 might only cover


a 1x1 cell, which gets upsampled to 7x7

  • Only local features (inside the ROI) are used for classification

Input

Feature extraction Classification “ROI Pooling” ConvNet

slide-6
SLIDE 6

conv3

L E T ’ S A D D S K I P C O N N E C T I O N S

conv5 fc6 fc7 bbox cls conv4 concatenate dim reduction

Feature extraction Classification

[Sermanet 2013]
 [Hariharan 2015]
 [Liu 2015]

slide-7
SLIDE 7

P R O B L E M : F E AT U R E A M P L I T U D E

  • Different layers have very

different amplitudes [Liu 2015]

  • We must account for this

to combine features

  • L2 normalize to length 1,

and then re-scale

slide-8
SLIDE 8

C O M B I N I N G A C R O S S L AY E R S

conv5 fc6 fc7 bbox cls conv3 conv4 normalize, concatenate, re-scale

slide-9
SLIDE 9

R E S C A L I N G F E AT U R E A M P L I T U D E S

20 40 60 80 conv5 conv5+4 conv5+4+3 conv5+4+3+2

74.6 74.6 74.4 71.5 49.3 63.6 69.7 70.8

Naive With L2 norm + rescaling

(same as Fast R-CNN)

VOC2007 mAP

slide-10
SLIDE 10

I O N : I N S I D E - O U T S I D E N E T

Base ConvNet: VGG16 [Simonyan 2014]

slide-11
SLIDE 11

I O N : I N S I D E - O U T S I D E N E T

Base ConvNet: VGG16 [Simonyan 2014]

slide-12
SLIDE 12

conv5 Convolutional
 features

L AT E R A L R N N ( M O V E S A C R O S S A N I M A G E )

Hidden state Output
 (which we interpret as 
 context features)

  • Repeat for each row
  • Can compute each


column in parallel

  • We can also move


in 4 different directions …

[Schuster 1997], [Graves 2009], [Byeon 2015], [Visin 2015]

slide-13
SLIDE 13

R N N I M P L E M E N TAT I O N

conv5 Right: conv5 Left: conv5 Up: conv5 Down:

slide-14
SLIDE 14

R N N I M P L E M E N TAT I O N

conv5 Right: conv5 Left: conv5 conv5 Up: Down: Transpose everything to left-to-right and write a single GPU implementation Abstract away the complexity:

slide-15
SLIDE 15

R N N I M P L E M E N TAT I O N

ReLU RNN:
 “IRNN” [Le 2015]

slide-16
SLIDE 16

Merge the hidden-to-output into a single conv.

R N N I M P L E M E N TAT I O N

slide-17
SLIDE 17

R N N I M P L E M E N TAT I O N

Share the input-to-hidden transition

slide-18
SLIDE 18

Features used by

  • ur detector

R N N I M P L E M E N TAT I O N

Stack 2 RNNs together Our final architecture:

slide-19
SLIDE 19

R N N : S PAT I A L D E P E N D E N C Y

slide-20
SLIDE 20

I O N : I N S I D E - O U T S I D E N E T

Base ConvNet: VGG16 [Simonyan 2014]

  • Inside: Skip connections with L2 normalization
  • Outside: Stacked 4-direction RNNs for context

Main changes:

slide-21
SLIDE 21

B E T T E R P R O P O S A L S , M O R E D ATA

+3.9 mAP on COCO test-dev,
 compared to Selective Search

slide-22
SLIDE 22

Faster R-CNN [Ren 2015]

R E G I O N P R O P O S A L N E T W O R K ( R P N )

slide-23
SLIDE 23
  • Original RPN [Ren 2015] used 9 anchors: 3 scales x 3 aspect ratios.


RPN works well for VOC, but not COCO

  • We extend this to 22 anchors: 7 scales x 3 aspect ratios, and 32x32
  • Avg. Recall

Selective Search [Uijlings 2013] 41.7% MCG [Arbelaez 2014] 51.6% RPN with 10 anchors [Ren 2015] 39.9% RPN with 22 anchors 44.1%

  • We mix MCG with RPN, which performs better than either individually


(1000 of each for training, 2000 of each for testing)

R E G I O N P R O P O S A L N E T W O R K ( R P N )

slide-24
SLIDE 24

B E T T E R T R A I N I N G / T E S T I N G

+4.1 mAP on COCO test-dev,
 compared to Fast R-CNN setup

slide-25
SLIDE 25

T R A I N I N G I M P R O V E M E N T S

  • No dropout (+0.6 mAP)
  • Train for longer with larger mini-batches


4 images (512 ROIs total) / batch (+0.8 mAP)

  • Regularize with semantic segmentation predictions (+1.3 mAP)


(see tech report) (mAP on test-dev)

slide-26
SLIDE 26

T E S T I N G I M P R O V E M E N T S

  • We use iterative box regression and weighted voting,

from MR-CNN [Gidaris 2015]

  • Helps on PASCAL (+2.0 mAP)
  • Reduces score on COCO (-0.5 mAP), since COCO

requires precise localization

  • New thresholds: NMS: ~0.45, voting: ~0.85 (+1.3 mAP)
  • Left-right flips: evaluate on original and flipped image

and average (+0.8 mAP)

[Gidaris 2015]

slide-27
SLIDE 27

C O M PA R I S O N T O R E S N E T ( W I N N E R ) [ H E 2 0 1 5 ]

Our single-model (post-competition) result: 33.1% mAP Combining ResNet101 and ION is potentially complimentary

slide-28
SLIDE 28

C O N C L U S I O N

Improvement breakdown:

  • New ION detector (+5.1 mAP)
  • Better proposals, more data (+3.9 mAP)
  • Better training/testing (+4.1 mAP)

Tech Report: http://arxiv.org/pdf/1512.04143.pdf

  • NVIDIA (GPU Donation)
  • Microsoft Research (Internship)

Thanks:

Sean
 Bell Kavita
 Bala Larry
 Zitnick Ross
 Girshick

ION

slide-29
SLIDE 29

E X T R A S L I D E S

slide-30
SLIDE 30

S U R P R I S I N G F I N D : H 2 H T R A N S I T I O N N O T N E E D E D We use H2H for our submission, but there is barely any drop without it!

slide-31
SLIDE 31

W H AT A B O U T O T H E R C O N T E X T M E T H O D S ?

slide-32
SLIDE 32

I S S E G M E N TAT I O N L O S S W O R T H I T ?

Test: +1mAP , same speed Train: 1.5x-2x slower

slide-33
SLIDE 33

H O W M A N Y R N N L AY E R S ?

slide-34
SLIDE 34

W H Y C O N V 3 , C O N V 4 , C O N V 5 ?

slide-35
SLIDE 35

R E S U LT S ( V O C 2 0 0 7 T E S T )

M E T H O D M A P

FA S T R - C N N 
 [ G I R S H I C K 2 0 1 4 ]

7 0 . 0

FA S T E R R - C N N 
 [ G I R S H I C K 2 0 1 5 ]

7 3 . 2

C O N V 3 + C O N V 4 + C O N V 5

7 5 . 6

+ R N N + S E G M E N TAT I O N L O S S

7 6 . 5

+ S E C O N D B B O X R E G R E S S I O N + W E I G H T E D V O T I N G

7 8 . 5

— D R O P O U T

7 9 . 2

slide-36
SLIDE 36

A C T I VAT I O N S

Input Positive Negative

slide-37
SLIDE 37

A C T I VAT I O N S

Input Positive Negative

slide-38
SLIDE 38

R N N A C T I VAT I O N S

Input Positive Negative

slide-39
SLIDE 39

R N N A C T I VAT I O N S

Input Positive Negative

slide-40
SLIDE 40

R N N A C T I VAT I O N S

Input Positive Negative

slide-41
SLIDE 41

R E C U R R E N T N E U R A L N E T W O R K S

[Karpathy 2015]

slide-42
SLIDE 42

T Y P E S O F R N N S

LSTM (Long Short- Term Memory) GRU (Gated Recurrent Unit) “Vanilla”
 RNN
 (tanh)

[Rumelhart 1986], [Hochreiter and Schmidhuber 1997], [Cho 2014]

slide-43
SLIDE 43

C A N W E U S E R E L U W I T H A N R N N ?

  • Replacing tanh with ReLU gave huge gains

for AlexNet

  • Is there some way to use ReLU with RNNs?

tanh ReLU “Vanilla”
 RNN
 (tanh)

[Krizhevsky 2012]

slide-44
SLIDE 44