Structural Priors in Deep Neural Networks
YANI IOANNOU, MAR. 12TH 2018
Structural Priors in Deep Neural Networks YANI IOANNOU, MAR. 12 TH - - PowerPoint PPT Presentation
Structural Priors in Deep Neural Networks YANI IOANNOU, MAR. 12 TH 2018 About Me o Yani Ioannou (yu-an-nu) o Ph.D. Student, University of Cambridge o Dept. of Engineering, Machine Intelligence Lab o Prof. Roberto Cipolla, Dr. Antonio Criminisi
YANI IOANNOU, MAR. 12TH 2018
(yu-an-nu)
unorganized point clouds of urban environments
(Published at 3DIMPVT)
MICCAI-BRATS 2014
Intellectual Ventures/Gates Foundation
NIPS 2016
CVPR 2016
Yani Ioannou, Duncan Robertson, Jamie Shotton, Roberto Cipolla, Antonio Criminisi ICLR 2016
Yani Ioannou, Duncan Robertson, Roberto Cipolla, Antonio Criminisi CVPR 2017
Microsoft Research Tech. Report (2015)
art DNNs for Imagenet were only getting more computationally complex
depth and width
complexity to improve generalization?
108 109 1010 1011 1012 1013 1014 6% 8% 10% 12% 14% 16% 18%
alexnet vgg-11 googlenet msra-c msra-b msra-a vgg-19 vgg-16 (D) vgg-16 (C) vgg-13 vgg-11 googlenet 10x googlenet 144x resnet-50-mirror-earlylr pre-resnet 200log10(Multiply-Accumulate Operations) Top-5 Error Crop & Mirror Aug. Extra Augmentation
17
DNNs showing that trained DNNs are over-parameterized
Incorporating our prior knowledge of the problem and its representation into the connective structure of a neural network
generalize as well as a network with a more appropriate parameterization
network architecture does
generalization
Prior Knowledge for Natural Images:
Convolutional Neural Networks
Structural Prior for Natural Images
c2 filters
h1 w1 c1 c1 H W c2 H W
ReLU
∗
…
input image/ feature map filter (parameters)
Convolutional Neural Networks
Structural Prior for Natural Images
Fully connected layer structure
Connection weights Kernel N/A Connection structure Input pixels Output pixels
Convolutional 3 × 3 square
Input pixels
10 0 1 2 3 4 5 6 7 8 9 11
Input image
(zero-padded 3 x 4 pixels)
1 2 3 8 9 10 11 4 5 6 7
Output feature map
(4 x 3)
1 2 3 8 9 10 11 4 5 6 7
Input pixels Output pixels
(a) (b)
Input pixels
10 0 1 2 3 4 5 6 7 8 9 11
My thesis is based on three novel contributions which have explored separate aspects of structural priors in DNN:
Prior Knowledge:
vertical/horizontal edges/relationships
low-rank filters
replaced with low rank approximations, e.g. Jaderberg (2014) Does every filter need to be square in a CNN?
Approximated Low-Rank Filters
Jaderberg, Max, Andrea Vedaldi, and Andrew Zisserman (2014) “Speeding up Convolutional Neural Networks with Low Rank Expansions”.
1 … W H c1 H W … c2 d H W c1 w c2 * * 1 h
c2 filters c3 filters
CNN with Low-Dimensional Embedding
Typical sub-architecture found in Network-in-Network, ResNet/Inception
c2 filters
…
c2
1
h1 w1 c1 c1 H W c2 c3 H W H W
ReLU c3filters
1
ReLU
∗ ∗
…
Proposed: Low-Rank Basis
Same total number of filters on each layer as original network, but 50% are 1x3, and 50% are 3x1
c2 filters c3filters
1 1
c1 H W c2 c3 H W H W
ReLU
c2
ReLU
∗ ∗
…
… …
Proposed Structural Prior: Low-Rank + Full Basis
25% of total filters are full 3x3
c2 filters c3filters
1 1
c1 H W c2 c3 H W H W
ReLU
c2
ReLU
∗ ∗
…
… … …
Inception
Learning a Filter-Size Basis – learning many small filters (1x1, 3x3), and fewer of the larger (5x5, 7x7)
1 W H c2 filters H W … c2 c3 H W c3 filters c2 1 * * … c1 … … … 1 c1 1 3 c1 3 c1 5 5 7 c1 7
max pooling
Low-Rank Basis
Structural Prior for CNNs
c2 filters c3filters
1 1
c1 H W c2 c3 H W H W
ReLU
c2
ReLU
∗ ∗
…
… … … VGG-11 ILSVRC 21% fewer parameters, 41% less computation (low-rank only)
1% pt higher accuracy, 16% less computation (low/full-rank mix)
Rethinking the Inception Architecture for Computer Vision
Christian Szegedy Google Inc.
szegedy@google.comVincent Vanhoucke
vanhoucke@google.comSergey Ioffe
sioffe@google.comJonathon Shlens
shlens@google.comZbigniew Wojna University College London
zbigniewwojna@gmail.comAbstract
Convolutional networks are at the core of most state-
to become mainstream, yielding substantial gains in vari-
putational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are explor- ing ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computa- tional cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.
Since the 2012 ImageNet competition [16] winning en- t b K i h k t l [9] th i t k “Al N t” h larly high performance in the 2014 ILSVRC [16] classifica- tion challenge. One interesting observation was that gains in the classification performance tend to transfer to signifi- cant quality gains in a wide variety of application domains. This means that architectural improvements in deep con- volutional architecture can be utilized for improving perfor- mance for most other computer vision tasks that are increas- ingly reliant on high quality, learned visual features. Also, improvements in the network quality resulted in new appli- cation domains for convolutional networks in cases where AlexNet features could not compete with hand engineered, crafted solutions, e.g. proposal generation in detection[4]. Although VGGNet [18] has the compelling feature of architectural simplicity, this comes at a high cost: evalu- ating the network requires a lot of computation. On the
was also designed to perform well even under strict con- straints on memory and computational budget. For exam- ple, GoogleNet employed only 5 million parameters, which represented a 12× reduction with respect to its predeces- sor AlexNet, which used 60 million parameters. Further- more, VGGNet employed about 3x more parameters than AlexNet. The computational cost of Inception is also much lower than VGGNet or its higher performing successors [6]. This has made it feasible to utilize Inception networks in big-data scenarios[17], [13], where huge amount of data needed to be processed at reasonable cost or scenarios where memory
arXiv:1512.00567v3 [cs.CV] 11 Dec 2015
1x1 1x1 1xn Pool 1x1 Base Filter Concat nx1 1xn nx1 1xn nx1 1x1 Figure 6. Inception modules after the factorization of the n × n29
Inception v.3
Google’s Inception architecture (v.3 and higher) uses our low-rank filters!
Prior Knowledge:
Does every filter need to be connected to every other filter on a previous layer in a CNN?
Network-in-Network conv1/conv2 covariance conv1 filters conv2 filters
AlexNet Filter Groups
Convolutional filters filters in layers with g groups only operate on 1/g of the # input channels
∗
c2 filters
h1 w1 c1 H W c2 H W
× " c2 g
" c1 g
" c2 g × ReLU
…
CNN with Low-Dimensional Embedding
Typical sub-architecture found in Network-in-Network, ResNet/Inception
c2 filters
…
c2
1
h1 w1 c1 c1 H W c2 c3 H W H W
ReLU c3filters
1
ReLU
∗ ∗
…
Root-2 Module
Structural Prior for CNNs with Sparse Inter-Filter Relationships
c2 filters
1
h1 w1 c1 H W c2 c3 H W H W
× ! c2 g
! c1 g
ReLU
c2
! c2 g × c3filters
1
ReLU
∗ ∗
…
Root-4 Module
Structural Prior for CNNs with Sparse Inter-Filter Relationships
c2 filters c3filters
1 1
h1 w1 c1 H W c2 c3 H W H W
× ! c2 g ! c2 g × ReLU
c2
× ! c2 g
! c1 g
! c2 g × ReLU
∗ ∗
…
input
Tree Increase # filter groups with depth Root Decrease # filter groups with depth Column Maintain constant # filters groups
input
input
CIFAR-10 Results
CIFAR-10 Results
ILSVRC12 Results – ResNet 50
ILSVRC12 Results – ResNet 200
Root Module
Structural Prior for CNNs with Sparse Inter-Filter Relationships
c2 filters c3filters
1 1
h1 w1 c1 H W c2 c3 H W H W
× ! c2 g ! c2 g × ReLU
c2
× ! c2 g
! c1 g
! c2 g × ReLU
∗ ∗
…
ResNet-200 ILSVRC 48% fewer parameters, 27% less computation identical (or very slightly higher) accuracy
Deep Roots
Xception
Google’s Xception architecture uses a form of root modules (#channels = #filter groups) - “Depthwise Separable Convolution”
ResNeXt
Facebook’s ResNet architecture uses root modules, denoted “Aggregated Residual Transforms”
search space/dimensionality
neural networks!
data?
Han, Song, Jeff Pool, John Tran, and William J. Dally (2015). “Learning both weights and connections for efficient neural networks
consuming
http://yani.io/annou yai20@cam.ac.uk