learning transferable architectures for scalable image
play

Learning Transferable Architectures for Scalable Image Recognition - PowerPoint PPT Presentation

Learning Transferable Architectures for Scalable Image Recognition Zoph et al. Introduction Architecture engineering Finding the architectures of the machine learning model Neural Architecture Search (NAS) Framework Automate


  1. Learning Transferable Architectures for Scalable Image Recognition Zoph et al.

  2. Introduction • Architecture engineering • Finding the architectures of the machine learning model • Neural Architecture Search (NAS) Framework • Automate architecture engineering using reinforcement learning search method • Problem: Computationally expensive on large datasets like ImageNet • Approach of this paper • Searching architecture on a smaller dataset like CIFAR-10 • Apply the learned architecture to bigger dataset

  3. Datasets • CIFAR-10 • 60,000 32x32 RGB images across 10 classes • 50,000 train and 10,000 test images • Using 5,000 random selected images from the training set as a validation set • Whitened and randomly crop 32x32 patches from upsampled images of size 40x40 and apply random horizontal flips • ImageNet • 14 million images • Resize to 299x299 or 331x331 resolution in this research

  4. NASNet Search Space • Allowing the transferability of the architecture • The complexity of the architecture is not related to the depth of the network and the size of input images • Repeated motifs in different architecture engineering with CNNs • Convolutional cells with identical structure but different weights • Expressing the repeated motifs • Compose the convolutional nets

  5. Neural Architecture Search (NAS) Framework

  6. Convolutional Cells • Normal Cell • Return a feature map of the same dimension • Reduction Cell • Return a feature map where the feature map height and width is reduced by a factor of two

  7. Searching the Structures of the Cells • The controller repeats the following 5 steps B times corresponding to the B blocks in a convolutional cell 1. Select a hidden state from hi, hi−1 or from the set of hidden states created in previous blocks. 2. Select a second hidden state from the same options as in Step 1. 3. Select an operation to apply to the hidden state selected in Step 1. 4. Select an operation to apply to the hidden state selected in Step 2. 5. Select a method to combine the outputs of Step 3 and 4 to create a new hidden state. • To predict both Normal Cell and Reduction Cell, do 2 x 5B predictions in total

  8. Searching the Structures of the Cells

  9. Searching the Structures of the Cells • Possible operations in step 3 and 4 • Identity • 1x3 then 3x1 convolution • 1x7 then 7x1 convolution • 3x3 dilated convolution • 3x3 average pooling • 3x3 max pooling • 5x5 max pooling • 7x7 max pooling • 1x1 convolution • 3x3 convolution • 3x3 depthwise-separable conv • 5x5 depthwise-seperable conv • 7x7 depthwise-separable conv

  10. Top Performing Cells

  11. Building network for a certain task • Convolutional cells • Number of cell repeats N • Number of filters in the initial convolutional cell

  12. ScheduledDropPath • DropPath • Each path in the cell is stochastically dropped with some fixed probability during training • Dose not work well for NASNets • ScheduledDropPath • Each path in the cell is dropped out with a probability that is linearly increased over the course of training • Significantly improves the final performance

  13. Result - CIFAR-10 Image Classification

  14. Result - ImageNet Image Classification

  15. Result

  16. Result - Object Detection

  17. Deep Speech: Scaling up end-to-end speech recognition Hannun et al.

  18. Introduction • Traditional speech systems • Human engineered processing pipelines • Great amount of efforts needed • Weak in noisy environment • Deep Speech • Applying deep learning end-to-end using RNN • No need of human designed components • Performance better on noisy speech recognition than traditional speech systems

  19. RNN Training • Input • Training set X = {(x (1) , y (1) ),(x (2) , y (2) ), . . .}, which x is a single utterance and y is label • x (i) , is a time-series of length T (i) where every time-slice is a vector of audio features • x (i) t,p denotes the power of the p’th frequency bin in the audio frame at time t • Output • Sequence of character probabilities for the transcription y, with y ˆ t = P(c t |x), where c t ∈ {a,b,c, . . . , z,space, apostrophe, blank}

  20. RNN Training • 5 layers of hidden units, the hidden units at layer l are denoted h (l) • First three layers • Non-recurrent • h (l) t = g(W (l) h (l−1) t + b (l) ) • g(z) = min{max{0, z}, 20} is the clipped rectified-linear (ReLu) activation function • W (l) is the weight matrix • b (l) is the bias parameters • First layer depends on the spectrogram frame x t along with a context of C frames on each side • Other layers operate on independent data for each time step

  21. RNN Training • The fourth layer • Bi-directional recurrent layer • Two sets of hidden units • Forward recurrence • h (f) t = g(W (4) h (3) t + W (f) r h (f) t−1 + b (4) ) • Computed sequentially from t = 1 to t = T (i) • Backward recurrence • h (b) t = g(W (4) h (3) t + W (b) r h (b) t+1 + b (4) ) • Computed sequentially in reverse from t = T (i) to t = 1 • The fifth layer • h (5) t = g(W (5) h (4) t + b (5) ) • h (4) t = h (f) t + h (b) t

  22. RNN Training • Output layer • Standard softmax functionthat yields the predicted character probabilities for each time slice t and character k in the alphabet • Only single recurrent layer • hardest to parallelize • Do not use Long-Short-Term-Memory circuits • Approaches avoid overfitting • Dropout the feedforward layers in a rate between 5% - 10% • Jittering the input • Applying language model to reduce error • Maximize Q(c) = log(P(c|x)) + α log(P lm (c)) + β word count(c)

  23. Optimizations • Data parallelism • Each GPU processes many examples in parallel • Large minibatches that a single GPU cannot support • Each GPU processing a separate minibatch of examples and then combining its computed gradient with its peers during each iteration • Model parallelism • Perform the computations of h (f) and h (b) in parallel • Problem: Time consuming on data transfers when calculate the fifth layer • One GPU for each half of the time-series • One compute h (f) first, another compute h (b) , exchange at mid-point • Striding • Shorten the recurrent layers by taking strides of size 2 in the original input

  24. Training Data • 5000 hours of read speech from 9600 speakers • Synthesize noisy training data • Using many short clips of noise sound • Rejecting noise clips where the average power in each frequency band differed significantly from the average power of real noisy recordings • Lombard Effect • Speakers actively change the pitch or inflections of their voice to overcome noise around them • Playing loud background noise through headphones worn by a person as they record an utterance

  25. Performance

  26. Testing Noisy Speech Performance • Few standards exist • Evaluation set of 100 noisy and 100 noise-free utterances from 10 speakers • Noise environments • Background radio or TV; washing dishes in a sink; a crowded cafeteria; a restaurant; and inside a car driving in the rain • Utterance text • Primarily from web search queries and text messages • Signal-to-noise ratio between 2 and 6 dB

  27. Performance

  28. Thank you!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend