Structural Priors in Deep Neural Networks YANI IOANNOU, MAR. 12 TH - - PowerPoint PPT Presentation

structural priors in deep neural networks
SMART_READER_LITE
LIVE PREVIEW

Structural Priors in Deep Neural Networks YANI IOANNOU, MAR. 12 TH - - PowerPoint PPT Presentation

Structural Priors in Deep Neural Networks YANI IOANNOU, MAR. 12 TH 2018 About Me o Yani Ioannou (yu-an-nu) o Ph.D. Student, University of Cambridge o Dept. of Engineering, Machine Intelligence Lab o Prof. Roberto Cipolla, Dr. Antonio Criminisi


slide-1
SLIDE 1

Structural Priors in Deep Neural Networks

YANI IOANNOU, MAR. 12TH 2018

slide-2
SLIDE 2

About Me

  • Yani Ioannou
  • Ph.D. Student, University of Cambridge
  • Dept. of Engineering, Machine Intelligence Lab
  • Prof. Roberto Cipolla, Dr. Antonio Criminisi (MSR)
  • Research scientist at Wayve
  • Self-driving car start-up in Cambridge
  • Have lived in 4 countries (Canada, UK, Cyprus and Japan)

(yu-an-nu)

slide-3
SLIDE 3

Research Background

  • M.Sc. Computing, Queen’s University
  • Prof. Michael Greenspan
  • 3D Computer Vision
  • Segmentation and recognition in massive

unorganized point clouds of urban environments

  • “Difference of Normals” multi-scale operator

(Published at 3DIMPVT)

slide-4
SLIDE 4

Research Background

  • Ph.D. Engineering, University of Cambridge (2014 - 2018)
  • Prof. Roberto Cipolla, Dr. Antonio Criminisi (Microsoft Research)
  • Microsoft PhD Scholarship, 9-month internship at Microsoft Research
  • c. 1496
  • c. 2012
slide-5
SLIDE 5

Ph.D. – Collaborative Work

  • Segmentation of brain tumour tissues with CNNs
  • D. Zikic, Y. Ioannou, M. Brown, A. Criminisi (MICCAI-BRATS 2014)

MICCAI-BRATS 2014

  • One of the first papers using deep learning for volumetric/medical imagery
  • Using CNNs for Malaria Diagnosis

Intellectual Ventures/Gates Foundation

  • Designed CNN for the classification of malaria parasites in blood smears
  • Measuring Neural Net Robustness with Constraints
  • O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytiniotis, A. Nori, A. Criminisi

NIPS 2016

  • Found that not all adversarial images can be used to improve network robustness
  • Refining Architectures of Deep Convolutional Neural Networks
  • S. Shankar, D. Robertson, Y. Ioannou, A. Criminisi, R. Cipolla

CVPR 2016

  • Proposed a method for adapting neural network architectures to new datasets
slide-6
SLIDE 6

Ph.D. – First Author

  • Thesis: “Structural Priors in Deep Neural Networks”
  • Training CNNs with Low-Rank Filters for Efficient Image Classification

Yani Ioannou, Duncan Robertson, Jamie Shotton, Roberto Cipolla, Antonio Criminisi ICLR 2016

  • Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups

Yani Ioannou, Duncan Robertson, Roberto Cipolla, Antonio Criminisi CVPR 2017

  • Decision Forests, Convolutional Networks and the Models In-Between
  • Y. Ioannou, D. Robertson, D. Zikic, P. Kontschieder, J. Shotton, M. Brown, A. Criminisi

Microsoft Research Tech. Report (2015)

slide-7
SLIDE 7

Motivation

  • Deep Neural Networks are massive!
  • AlexNet1 (2012)
  • 61 million parameters
  • 724 million FLOPS
  • Most compute in conv. layers
1 Krizhevsky, Sutskever, and Hinton, “ImageNet Classification with Deep Convolutional Neural Networks” 2 He, Zhang, Ren, and Sun, “Deep Residual Learning for Image Recognition”
slide-8
SLIDE 8

Motivation

  • Deep Neural Networks are massive!
  • AlexNet1 (2012)
  • 61 million parameters
  • 724 million FLOPS
  • 96% of param in F.C. layers!
1 Krizhevsky, Sutskever, and Hinton, “ImageNet Classification with Deep Convolutional Neural Networks” 2 He, Zhang, Ren, and Sun, “Deep Residual Learning for Image Recognition”
slide-9
SLIDE 9

Motivation

  • Deep Neural Networks are massive!
  • AlexNet1 (2012)
  • 61 million parameters
  • 7.24x108 million FLOPS
  • ResNet2 200 (2015)
  • 62.5 million parameters
  • 5.65x1012 FLOPS
  • 2-3 weeks of training on 8 GPUs
1 Krizhevsky, Sutskever, and Hinton, “ImageNet Classification with Deep Convolutional Neural Networks” 2 He, Zhang, Ren, and Sun, “Deep Residual Learning for Image Recognition”
slide-10
SLIDE 10

Motivation

  • Until very recently, state-of-the-

art DNNs for Imagenet were only getting more computationally complex

  • Each generation increased in

depth and width

  • Is it necessary to increase

complexity to improve generalization?

108 109 1010 1011 1012 1013 1014 6% 8% 10% 12% 14% 16% 18%

alexnet vgg-11 googlenet msra-c msra-b msra-a vgg-19 vgg-16 (D) vgg-16 (C) vgg-13 vgg-11 googlenet 10x googlenet 144x resnet-50-mirror-earlylr pre-resnet 200

log10(Multiply-Accumulate Operations) Top-5 Error Crop & Mirror Aug. Extra Augmentation

17

slide-11
SLIDE 11

Over-parameterization of DNNs

  • There are many proposed methods for improving the test time efficiency of

DNNs showing that trained DNNs are over-parameterized

  • Compression
  • Pruning
  • Reduced Representation
slide-12
SLIDE 12

Structural Prior

Incorporating our prior knowledge of the problem and its representation into the connective structure of a neural network

  • Optimization of neural networks needs to learn what weights not to use
  • This is usually achieved with regularization
  • Can we structure networks closer to the specialized components used for learning images with
  • ur prior knowledge of the problem/it’s representation?
  • Structural Priors ⊂ Network Architecture
  • architecture is a more general term, i.e., number of layers, activation functions, pooling, etc.
slide-13
SLIDE 13

Regularization

  • Regularization does help training, but is not a substitute for good structural priors
  • MacKay (1991): regularization is not enough to make an over-parameterized network

generalize as well as a network with a more appropriate parameterization

  • We liken regularization to a weak structural prior
  • Used where our only prior knowledge is that our network is greatly over-parameterized
slide-14
SLIDE 14

Rethinking Regularization

  • “Understanding deep learning requires rethinking generalization”, Zhang et al., 2016
  • “Deep neural networks easily fit random labels.”
  • Identifies types of “regularization”:
  • “Explicit regularization” – i.e. weight decay, dropout and data augmentation
  • “Implicit regularization” – i.e. early stopping, batch normalization
  • “Network architecture”
  • Explicit regularization has little effect on fitting random labels, while implicit regularization and

network architecture does

  • Highlights the importance of network architecture, and by extension structural priors, for good

generalization

slide-15
SLIDE 15

Convolutional Neural Networks

Prior Knowledge for Natural Images:

  • Local correlations are very important
  • -> Convolutional filters
  • We don’t need to learn a different filter for each pixel
  • -> Shared weights
slide-16
SLIDE 16

Convolutional Neural Networks

Structural Prior for Natural Images

c2 filters

h1 w1 c1 c1 H W c2 H W

ReLU

input image/ feature map filter (parameters)

slide-17
SLIDE 17

Convolutional Neural Networks

Structural Prior for Natural Images

Fully connected layer structure

Connection weights Kernel N/A Connection structure Input pixels Output pixels

Convolutional 3 × 3 square

Input pixels

10 0 1 2 3 4 5 6 7 8 9 11

Input image

(zero-padded 3 x 4 pixels)

1 2 3 8 9 10 11 4 5 6 7

Output feature map

(4 x 3)

1 2 3 8 9 10 11 4 5 6 7

Input pixels Output pixels

(a) (b)

Input pixels

10 0 1 2 3 4 5 6 7 8 9 11

slide-18
SLIDE 18

Ph.D. Thesis Outline

My thesis is based on three novel contributions which have explored separate aspects of structural priors in DNN:

  • I. Spatial Connectivity
  • II. Inter-Filter Connectivity
  • III. Conditional Connectivity
slide-19
SLIDE 19

Spatial Connectivity

slide-20
SLIDE 20

Spatial Connectivity

Prior Knowledge:

  • Many of the filters learned in CNNs appear to be representing

vertical/horizontal edges/relationships

  • Many others appear to be representable by combinations of

low-rank filters

  • Previous work had shown that full-rank filters could be

replaced with low rank approximations, e.g. Jaderberg (2014) Does every filter need to be square in a CNN?

slide-21
SLIDE 21

Approximated Low-Rank Filters

Jaderberg, Max, Andrea Vedaldi, and Andrew Zisserman (2014) “Speeding up Convolutional Neural Networks with Low Rank Expansions”.

1 … W H c1 H W … c2 d H W c1 w c2 * * 1 h

c2 filters c3 filters

slide-22
SLIDE 22

CNN with Low-Dimensional Embedding

Typical sub-architecture found in Network-in-Network, ResNet/Inception

c2 filters

c2

1

h1 w1 c1 c1 H W c2 c3 H W H W

ReLU c3filters

1

ReLU

∗ ∗

slide-23
SLIDE 23

Proposed: Low-Rank Basis

Same total number of filters on each layer as original network, but 50% are 1x3, and 50% are 3x1

c2 filters c3filters

1 1

c1 H W c2 c3 H W H W

ReLU

c2

ReLU

∗ ∗

… …

slide-24
SLIDE 24

Proposed Structural Prior: Low-Rank + Full Basis

25% of total filters are full 3x3

c2 filters c3filters

1 1

c1 H W c2 c3 H W H W

ReLU

c2

ReLU

∗ ∗

… … …

slide-25
SLIDE 25

Inception

Learning a Filter-Size Basis – learning many small filters (1x1, 3x3), and fewer of the larger (5x5, 7x7)

1 W H c2 filters H W … c2 c3 H W c3 filters c2 1 * * … c1 … … … 1 c1 1 3 c1 3 c1 5 5 7 c1 7

slide-26
SLIDE 26

ImageNet Results

  • gmp: vgg-11 w/ global

max pooling

  • gmp-lr-2x:
  • 60% less computation
  • gmp-lr-join-wfull:
  • 16% less computation
  • 1% pt. lower error
slide-27
SLIDE 27

Low-Rank Basis

Structural Prior for CNNs

c2 filters c3filters

1 1

c1 H W c2 c3 H W H W

ReLU

c2

ReLU

∗ ∗

… … … VGG-11 ILSVRC 21% fewer parameters, 41% less computation (low-rank only)

  • r

1% pt higher accuracy, 16% less computation (low/full-rank mix)

slide-28
SLIDE 28

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy Google Inc.

szegedy@google.com

Vincent Vanhoucke

vanhoucke@google.com

Sergey Ioffe

sioffe@google.com

Jonathon Shlens

shlens@google.com

Zbigniew Wojna University College London

zbigniewwojna@gmail.com

Abstract

Convolutional networks are at the core of most state-

  • f-the-art computer vision solutions for a wide variety of
  • tasks. Since 2014 very deep convolutional networks started

to become mainstream, yielding substantial gains in vari-

  • us benchmarks. Although increased model size and com-

putational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are explor- ing ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computa- tional cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.

  • 1. Introduction

Since the 2012 ImageNet competition [16] winning en- t b K i h k t l [9] th i t k “Al N t” h larly high performance in the 2014 ILSVRC [16] classifica- tion challenge. One interesting observation was that gains in the classification performance tend to transfer to signifi- cant quality gains in a wide variety of application domains. This means that architectural improvements in deep con- volutional architecture can be utilized for improving perfor- mance for most other computer vision tasks that are increas- ingly reliant on high quality, learned visual features. Also, improvements in the network quality resulted in new appli- cation domains for convolutional networks in cases where AlexNet features could not compete with hand engineered, crafted solutions, e.g. proposal generation in detection[4]. Although VGGNet [18] has the compelling feature of architectural simplicity, this comes at a high cost: evalu- ating the network requires a lot of computation. On the

  • ther hand, the Inception architecture of GoogLeNet [20]

was also designed to perform well even under strict con- straints on memory and computational budget. For exam- ple, GoogleNet employed only 5 million parameters, which represented a 12× reduction with respect to its predeces- sor AlexNet, which used 60 million parameters. Further- more, VGGNet employed about 3x more parameters than AlexNet. The computational cost of Inception is also much lower than VGGNet or its higher performing successors [6]. This has made it feasible to utilize Inception networks in big-data scenarios[17], [13], where huge amount of data needed to be processed at reasonable cost or scenarios where memory

arXiv:1512.00567v3 [cs.CV] 11 Dec 2015

1x1 1x1 1xn Pool 1x1 Base Filter Concat nx1 1xn nx1 1xn nx1 1x1 Figure 6. Inception modules after the factorization of the n × n
  • convolutions. In our proposed architecture, we chose n = 7 for
the 17× 17grid. (The f lter sizes are picked using principle 3) 1x1 1x1 3x3 Pool 1x1 Base Filter Concat 1x3 1x3 1x1 3x1 3x1 Figure 7. Inception modules with expanded the f lter bank outputs. This architecture is used on the coarsest (8× 8) grids to promote high dimensional representations, as suggested by principle 2 of Section 2. We are using this solution only on the coarsest grid, since that is the place where producing high dimensional sparse representation is the most critical as the ratio of local processing (by 1 × 1 convolutions) is increased compared to the spatial ag- gregation.

29

Inception v.3

Google’s Inception architecture (v.3 and higher) uses our low-rank filters!

slide-29
SLIDE 29

Inter-Filter Connectivity

slide-30
SLIDE 30

Inter-filter Connectivity

Prior Knowledge:

  • CNNs learn sparse, distributed representations
  • Most filters on adjacent layers have low correlation

Does every filter need to be connected to every other filter on a previous layer in a CNN?

Network-in-Network conv1/conv2 covariance conv1 filters conv2 filters

slide-31
SLIDE 31

AlexNet Filter Groups

  • AlexNet1 used model parallelization to fit in the GPU memory constraints of the time
  • “filter groups” used to split the network into two on all conv layers (except conv3)
1 Krizhevsky, Sutskever, and Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”
slide-32
SLIDE 32

AlexNet Filter Groups

Convolutional filters filters in layers with g groups only operate on 1/g of the # input channels

c2 filters

h1 w1 c1 H W c2 H W

× " c2 g

" c1 g

" c2 g × ReLU

slide-33
SLIDE 33

AlexNet Filter Groups

  • Filter groups reduce connectivity between filters, allowing easier model parallelization
  • Filter groups drastically reduce the number of parameters, and computation
  • … and they don’t seem to affect the generalization of AlexNet?!
slide-34
SLIDE 34

CNN with Low-Dimensional Embedding

Typical sub-architecture found in Network-in-Network, ResNet/Inception

c2 filters

c2

1

h1 w1 c1 c1 H W c2 c3 H W H W

ReLU c3filters

1

ReLU

∗ ∗

slide-35
SLIDE 35

Root-2 Module

Structural Prior for CNNs with Sparse Inter-Filter Relationships

c2 filters

1

h1 w1 c1 H W c2 c3 H W H W

× ! c2 g

! c1 g

ReLU

c2

! c2 g × c3filters

1

ReLU

∗ ∗

slide-36
SLIDE 36

Root-4 Module

Structural Prior for CNNs with Sparse Inter-Filter Relationships

c2 filters c3filters

1 1

h1 w1 c1 H W c2 c3 H W H W

× ! c2 g ! c2 g × ReLU

c2

× ! c2 g

! c1 g

! c2 g × ReLU

∗ ∗

slide-37
SLIDE 37

Network in Network Filter Groups

  • Replace non-spatial convolutional layers with root modules
slide-38
SLIDE 38

Filter Group Topologies

  • But how many groups to use? Should this change with depth?
  • We explored 3 basic topologies on CIFAR10:

input

  • utput

Tree Increase # filter groups with depth Root Decrease # filter groups with depth Column Maintain constant # filters groups

input

  • utput

input

  • utput
slide-39
SLIDE 39

CIFAR-10 Results

slide-40
SLIDE 40

CIFAR-10 Results

slide-41
SLIDE 41

Covariance

  • Block-diagonal sparsity effected by a root-module is visible in the inter-layer correlation
slide-42
SLIDE 42

ILSVRC12 Results – ResNet 50

  • root-16
  • 27% fewer parameters
  • 37% less computation
  • CPU 23% faster
  • GPU 13% faster
  • (not optimized!)
  • 0.2% pt. lower error
  • root-64
  • 40% fewer parameters
  • 45% less computation
  • CPU 31% faster
  • GPU 12% faster
  • 0.1% pt. higher error
slide-43
SLIDE 43

ILSVRC12 Results – ResNet 200

  • root-64
  • 27% fewer parameters
  • 48% less computation
  • 0.2% pt. lower error
  • 0.14% lower error
slide-44
SLIDE 44

Root Module

Structural Prior for CNNs with Sparse Inter-Filter Relationships

c2 filters c3filters

1 1

h1 w1 c1 H W c2 c3 H W H W

× ! c2 g ! c2 g × ReLU

c2

× ! c2 g

! c1 g

! c2 g × ReLU

∗ ∗

ResNet-200 ILSVRC 48% fewer parameters, 27% less computation identical (or very slightly higher) accuracy

slide-45
SLIDE 45

Deep Roots

slide-46
SLIDE 46

Xception

Google’s Xception architecture uses a form of root modules (#channels = #filter groups) - “Depthwise Separable Convolution”

slide-47
SLIDE 47

ResNeXt

Facebook’s ResNet architecture uses root modules, denoted “Aggregated Residual Transforms”

slide-48
SLIDE 48

Conclusion

  • Structural priors are important for both generalization and efficiency
  • They are not simply replaced by strong regularization
  • Simplify the optimization of deep neural networks by constraining the

search space/dimensionality

  • There is still a lot we don’t understand about the optimization of deep

neural networks!

slide-49
SLIDE 49

Research Directions

  • I. Automatically Discovering Structural Priors
  • II. Learning with “Natural” Datasets
  • III. Jointly Exploiting Random Exploration and Imitation
slide-50
SLIDE 50

Research Directions

  • I. Automatically Discovering Structural Priors
  • Can we find methods of automatically discovering good structural priors from

data?

  • Pruning does not improving generalization
  • Greedily growing networks leads to poor generalization
  • Results by Han et. al1 show some promise: pruning/growing cycle
  • Infer connectivity by analyzing inter-channel correlations in training data?

Han, Song, Jeff Pool, John Tran, and William J. Dally (2015). “Learning both weights and connections for efficient neural networks

slide-51
SLIDE 51

Research Directions

  • II. Learning with “Natural” Datasets
  • Both ML and Computer Vision are dataset driven fields
  • ImageNet, CIFAR and MNIST are class-balanced
  • Current solutions involve either throwing away data or fiddling with loss weighting
slide-52
SLIDE 52

Research Directions

  • III. Exploiting both random exploration and imitation
  • RL is appealing - agents learn entirely from experience in an environment
  • For many problems this isn’t data efficient enough or feasible:
  • e.g. learning to drive a car – randomly exploring in a real environment is dangerous and time

consuming

  • But we can easily collect data from a human driver for real-world driving
  • Use supervised learning to bootstrap RL
slide-53
SLIDE 53

Questions

http://yani.io/annou yai20@cam.ac.uk