Learning Transferable Architectures for Scalable Image Recognition - - PowerPoint PPT Presentation

learning transferable architectures for scalable image
SMART_READER_LITE
LIVE PREVIEW

Learning Transferable Architectures for Scalable Image Recognition - - PowerPoint PPT Presentation

Learning Transferable Architectures for Scalable Image Recognition Zoph et al. Introduction Architecture engineering Finding the architectures of the machine learning model Neural Architecture Search (NAS) Framework Automate


slide-1
SLIDE 1

Learning Transferable Architectures for Scalable Image Recognition

Zoph et al.

slide-2
SLIDE 2

Introduction

  • Architecture engineering
  • Finding the architectures of the machine learning model
  • Neural Architecture Search (NAS) Framework
  • Automate architecture engineering using reinforcement learning search

method

  • Problem: Computationally expensive on large datasets like ImageNet
  • Approach of this paper
  • Searching architecture on a smaller dataset like CIFAR-10
  • Apply the learned architecture to bigger dataset
slide-3
SLIDE 3

Datasets

  • CIFAR-10
  • 60,000 32x32 RGB images across 10 classes
  • 50,000 train and 10,000 test images
  • Using 5,000 random selected images from the training set as a validation set
  • Whitened and randomly crop 32x32 patches from upsampled images of size

40x40 and apply random horizontal flips

  • ImageNet
  • 14 million images
  • Resize to 299x299 or 331x331 resolution in this research
slide-4
SLIDE 4

NASNet Search Space

  • Allowing the transferability of the architecture
  • The complexity of the architecture is not related to the depth of the

network and the size of input images

  • Repeated motifs in different architecture engineering with CNNs
  • Convolutional cells with identical structure but different weights
  • Expressing the repeated motifs
  • Compose the convolutional nets
slide-5
SLIDE 5

Neural Architecture Search (NAS) Framework

slide-6
SLIDE 6

Convolutional Cells

  • Normal Cell
  • Return a feature map of the same

dimension

  • Reduction Cell
  • Return a feature map where the

feature map height and width is reduced by a factor of two

slide-7
SLIDE 7

Searching the Structures of the Cells

  • The controller repeats the following 5 steps B times corresponding to

the B blocks in a convolutional cell

1. Select a hidden state from hi, hi−1 or from the set of hidden states created in previous blocks. 2. Select a second hidden state from the same options as in Step 1. 3. Select an operation to apply to the hidden state selected in Step 1. 4. Select an operation to apply to the hidden state selected in Step 2. 5. Select a method to combine the outputs of Step 3 and 4 to create a new hidden state.

  • To predict both Normal Cell and Reduction Cell, do 2 x 5B predictions

in total

slide-8
SLIDE 8

Searching the Structures of the Cells

slide-9
SLIDE 9

Searching the Structures of the Cells

  • Possible operations in step 3 and 4
  • Identity
  • 1x3 then 3x1 convolution
  • 1x7 then 7x1 convolution
  • 3x3 dilated convolution
  • 3x3 average pooling
  • 3x3 max pooling
  • 5x5 max pooling
  • 7x7 max pooling
  • 1x1 convolution
  • 3x3 convolution
  • 3x3 depthwise-separable conv
  • 5x5 depthwise-seperable conv
  • 7x7 depthwise-separable conv
slide-10
SLIDE 10

Top Performing Cells

slide-11
SLIDE 11

Building network for a certain task

  • Convolutional cells
  • Number of cell repeats N
  • Number of filters in the initial convolutional cell
slide-12
SLIDE 12

ScheduledDropPath

  • DropPath
  • Each path in the cell is stochastically dropped with some fixed probability

during training

  • Dose not work well for NASNets
  • ScheduledDropPath
  • Each path in the cell is dropped out with a probability that is linearly

increased over the course of training

  • Significantly improves the final performance
slide-13
SLIDE 13

Result - CIFAR-10 Image Classification

slide-14
SLIDE 14

Result - ImageNet Image Classification

slide-15
SLIDE 15

Result

slide-16
SLIDE 16

Result - Object Detection

slide-17
SLIDE 17

Deep Speech: Scaling up end-to-end speech recognition

Hannun et al.

slide-18
SLIDE 18

Introduction

  • Traditional speech systems
  • Human engineered processing pipelines
  • Great amount of efforts needed
  • Weak in noisy environment
  • Deep Speech
  • Applying deep learning end-to-end using RNN
  • No need of human designed components
  • Performance better on noisy speech recognition than traditional speech

systems

slide-19
SLIDE 19

RNN Training

  • Input
  • Training set X = {(x(1), y(1)),(x(2), y(2)), . . .}, which x is a single utterance and y is

label

  • x(i), is a time-series of length T(i) where every time-slice is a vector of audio

features

  • x(i)

t,p denotes the power of the p’th frequency bin in the audio frame at time t

  • Output
  • Sequence of character probabilities for the transcription y, with yˆ

t = P(ct|x),

where ct ∈ {a,b,c, . . . , z,space, apostrophe, blank}

slide-20
SLIDE 20

RNN Training

  • 5 layers of hidden units, the hidden units at layer l are denoted h(l)
  • First three layers
  • Non-recurrent
  • h(l)

t = g(W(l)h(l−1)t + b(l))

  • g(z) = min{max{0, z}, 20} is the clipped rectified-linear (ReLu) activation function
  • W(l) is the weight matrix
  • b(l) is the bias parameters
  • First layer depends on the spectrogram frame xt along with a context of C

frames on each side

  • Other layers operate on independent data for each time step
slide-21
SLIDE 21

RNN Training

  • The fourth layer
  • Bi-directional recurrent layer
  • Two sets of hidden units
  • Forward recurrence
  • h(f)

t = g(W(4)h(3) t + W(f) rh(f) t−1 + b(4))

  • Computed sequentially from t = 1 to t = T(i)
  • Backward recurrence
  • h(b)

t = g(W(4)h(3) t + W(b) rh(b) t+1 + b(4))

  • Computed sequentially in reverse from t = T(i) to t = 1
  • The fifth layer
  • h(5)

t = g(W(5)h(4) t + b(5))

  • h(4)

t = h(f) t + h(b) t

slide-22
SLIDE 22

RNN Training

  • Output layer
  • Standard softmax functionthat yields the predicted character probabilities for each

time slice t and character k in the alphabet

  • Only single recurrent layer
  • hardest to parallelize
  • Do not use Long-Short-Term-Memory circuits
  • Approaches avoid overfitting
  • Dropout the feedforward layers in a rate between 5% - 10%
  • Jittering the input
  • Applying language model to reduce error
  • Maximize Q(c) = log(P(c|x)) + α log(Plm(c)) + β word count(c)
slide-23
SLIDE 23

Optimizations

  • Data parallelism
  • Each GPU processes many examples in parallel
  • Large minibatches that a single GPU cannot support
  • Each GPU processing a separate minibatch of examples and then combining its computed

gradient with its peers during each iteration

  • Model parallelism
  • Perform the computations of h(f) and h(b) in parallel
  • Problem: Time consuming on data transfers when calculate the fifth layer
  • One GPU for each half of the time-series
  • One compute h(f) first, another compute h(b), exchange at mid-point
  • Striding
  • Shorten the recurrent layers by taking strides of size 2 in the original input
slide-24
SLIDE 24

Training Data

  • 5000 hours of read speech from 9600 speakers
  • Synthesize noisy training data
  • Using many short clips of noise sound
  • Rejecting noise clips where the average power in each frequency band

differed significantly from the average power of real noisy recordings

  • Lombard Effect
  • Speakers actively change the pitch or inflections of their voice to overcome

noise around them

  • Playing loud background noise through headphones worn by a person as they

record an utterance

slide-25
SLIDE 25

Performance

slide-26
SLIDE 26

Testing Noisy Speech Performance

  • Few standards exist
  • Evaluation set of 100 noisy and 100 noise-free utterances from 10

speakers

  • Noise environments
  • Background radio or TV; washing dishes in a sink; a crowded cafeteria; a

restaurant; and inside a car driving in the rain

  • Utterance text
  • Primarily from web search queries and text messages
  • Signal-to-noise ratio between 2 and 6 dB
slide-27
SLIDE 27

Performance

slide-28
SLIDE 28

Thank you!