Deep Learning Introduction Christian Szegedy Geoffrey Irving - - PowerPoint PPT Presentation

deep learning introduction
SMART_READER_LITE
LIVE PREVIEW

Deep Learning Introduction Christian Szegedy Geoffrey Irving - - PowerPoint PPT Presentation

Deep Learning Introduction Christian Szegedy Geoffrey Irving Google Research Machine Learning Supervised Learning Task Assume Ground truth G Model architecture f Prediction metric Training samples Find model


slide-1
SLIDE 1

Deep Learning Introduction

Christian Szegedy Geoffrey Irving Google Research

slide-2
SLIDE 2

Machine Learning

Supervised Learning Task

  • Assume
  • Find model parameters m ϵ M such that the expected
  • Ground truth G
  • Model architecture f
  • Prediction metric σ
  • Training samples

is minimized

slide-3
SLIDE 3

Machine Learning

Unsupervised Learning Set of tasks that work on the uncurated data. Predict properties that are inherently present in the data alone.

slide-4
SLIDE 4

Machine Learning

Generative Learning Task

  • : Ω(D) ⟶ [0, 1]
  • f : [0, 1]n⨉ M ⟶ D

Find model parameters m ϵ M such that: (f(S, m)) ~ ’ (S)

  • Input space with probability measure
  • Generative model architecture
slide-5
SLIDE 5

Machine Learning

Supervised Learning as Marginal Computation

  • : Ω(D ⨉ P) ⟶ [0, 1]
  • f : [0, 1]n⨉ D ⨉ M ⟶ P

Find model parameters m ϵ M such that: (f(S, d, m)|d) ~ ’ (S)

  • Expanded Input space
  • Conditional generative model.
slide-6
SLIDE 6

Deep versus Shallow Learning

Hand crafted Features Predictor Learned Features Predictor Data Data

Traditional machine learning Deep Learning

slide-7
SLIDE 7

Deep versus Shallow Learning

Hand crafted Features Predictor Learned Features Predictor Data Data

Traditional machine learning Deep Learning

slide-8
SLIDE 8

Deep versus Shallow Learning

Hand crafted Features Predictor Learned Features Predictor Data Data

Traditional machine learning Deep Learning

  • Mostly convex, provably

tractable.

  • Special purpose

solvers.

  • Non-layered

architectures.

  • Mostly NP-Hard
  • General purpose

solvers.

  • Hierarchical models
slide-9
SLIDE 9

Provably Tractable Deep Learning Approaches

  • Sum-Product networks [by Hoifung Poon and Pedro Domingos]

○ Can learn generative models ○ Hierarchical structure ○ Automated learning of low level features ○ Tractable training/inference under certain conditions ○ Practical implementations

  • Provable Bounds for Learning Some Deep Representations [Sanjeev

Arora, Aditya Bhaskara, Rong Ge and Tengyu Ma]

○ Can learn generative models. ○ Hierarchical structure ○ Automated learning of low level features ○ Provably tractable for extremely sparse graphs ○ Creates deep and sparse artificial neural networks ○ Based on the polynomial time solvable graph-square-root problem.

slide-10
SLIDE 10

Classical Feed-Forward Artificial Neural Networks

W1 x + b1 Input v W2 x + b2 tanh(x)

...

W x + b tanh(x) Loss (e.g SVM) Each sample is a vector (Element-wise nonlinearity)

Minimize

Multilayer perceptron [Frank Rosenblatt, 1961]

slide-11
SLIDE 11

Classical Feed-Forward Artificial Neural Networks

W1 x + b1 Input v W2 x + b2 tanh(x)

...

W x + b tanh(x) Loss (e.g SVM) Each sample is a vector (Element-wise nonlinearity)

Minimize

In today’s networks, tanh is increasingly replaced by max(x, 0)

(Rectified linear units or ReLUs)

slide-12
SLIDE 12

Classical Feed-Forward Artificial Neural Networks

W1 x + b1 Input v W2 x + b2 tanh(x)

...

W x + b tanh(x) Loss (e.g SVM) Each sample is a vector (Element-wise nonlinearity)

Minimize

Huge Sum, ranges over all training examples! A highly nonlinear function!

slide-13
SLIDE 13

Optimizing the Neural Network Parameters

With Minimize

slide-14
SLIDE 14

Optimizing the Neural Network Parameters

With Minimize Use gradient descent in the parameter space:

slide-15
SLIDE 15

Stochastic Gradient Descent

Learning rate α Randomly sampled Minibatch Bi

slide-16
SLIDE 16

Compute derivatives via chain rule

W1 x + b1 Input v W2 x + b2 tanh(x)

...

W x + b tanh(x) Loss (e.g SVM) Each sample is a vector (Element-wise nonlinearity)

Backpropagation algorithm

Gradients propagated by a backward pass recursively Forward propagated function values Local gradient of the function involved:

Rummelhart et al, 1986

slide-17
SLIDE 17

Sketch of Deep Artificial Neural Network Training

  • Sample batch Bi of training examples
  • Maintain network parameters
  • Compute network output N(v) for each training example v
  • Compute loss(N(v)) of each of the predictions.
  • Use backpropagation to compute the gradients g with

respect to the model parameters.

  • Update M by subtracting αg.
slide-18
SLIDE 18

Real Life Deep Network Training

  • Data collection and preprocessing and input encoding
  • Choosing a suitable framework that can do automatic

differentiation.

  • Designing suitable network architecture
  • Using more sophisticated optimizers
  • Implementation optimization:

○ Hardware acceleration, esp. GPU ○ Distributed training using multiple model replicas

  • Choose hyperparameters like learning rate and weights

for auxiliary losses.

slide-19
SLIDE 19

Convolutional Networks

Spatial Parameter-sharing. Neocognitron by [K. Fukushima, 1980]. Convolutional Neural Network, by Yann Lecun et al. (1988).

(Image credit: Yann Lecun)

slide-20
SLIDE 20

Deep versus Shallow Learning

Hand crafted Features Predictor Learned Features Predictor Data Data

Traditional machine learning Deep Learning

slide-21
SLIDE 21

Low level features learned by vision networks

ImageNet Classification with Deep Convolutional Neural Networks [Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton 2012]

slide-22
SLIDE 22

DeepDream visualization of internal feature representation

[Alexander Mordvintsev, Christopher Olah, Mike Tyka, 2015] Starting from white noise image, backpropagate the gradient from a trained network to the image pixel and try to maximize the response of various feature outputs.

slide-23
SLIDE 23

Cambrian Explosion of Deep Vision Research

Zeiler-Fergus Network (ILSVRC winner 2013) Inception-v1 (GoogLeNet), ILSVRC winner 2014

slide-24
SLIDE 24

Fisher Vectors + Hand crafted features Convolutional networks Inception-v1 convolutional network Residual convolutional network Better than-human performance Task: 1000 fine grained classes including the difference between “Eskimo dog” and “Siberian husky”

slide-25
SLIDE 25

Eskimo dog Siberian husky

Example images from the ImageNet dataset (ImageNet Large Scale Visual Recognition Challenge, IJCV 2015 by

Russakovsky et al)

slide-26
SLIDE 26

Object Detection

VOC benchmark: detecting objects for 20 different categories (persons, cars, cats, birds, potted plants, bottles, chairs etc.) State of the art:

Pre-deep learning in 2013 (Deformable

Parts)

Deep-learning model 2015 36% mAP 78% mAP

slide-27
SLIDE 27

Stylistic Transfer using Deep Neural Features

Source: Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artwork, nucl.ai Conference 2016 by Alex J. Champandard [2016] http://arxiv.org/pdf/1603.01768v1.pdf

slide-28
SLIDE 28

Real Life Applications of Deep Vision Networks

Google Image and Photo Search (Inception-v2) Face detection and tagging in Google photos PlaNet Identifying the location where image was taken StreetView privacy protection Google Visual Translate Nvidia’s DriveNet

All of the above applications use variants

  • f the Inception network architecture.
slide-29
SLIDE 29

Recurrent Neural Networks

Parameter-sharing over time. LSTM: Long-short term memory by [Sepp Hochreiter, J urgen Schmidhuber, 1997] (Image credit:

Cristopher Olah) http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf

...

slide-30
SLIDE 30

Generative Models of Text

[Andrej Karpathy 2016]

slide-31
SLIDE 31

Some Real life applications of recurrent networks

Voice transcription in phones [Siri, OK Google] Video Captioning in YouTube Google Translate House number transcription from StreetView to Google Maps

slide-32
SLIDE 32

Open Source Deep Learning Frameworks

  • Lua API
  • Long history
  • GPU backend (via cudnn)
  • Most control about dynamic execution
  • No support for distributed training

torch

http://torch.ch

slide-33
SLIDE 33

Open Source Deep Learning Frameworks

Theano

  • Python API
  • University of Montreal project
  • Fast GPU backend (via cudnn)
  • Less control over dynamic execution than torch
  • No support for distributed training

http://deeplearning.net/software/theano

slide-34
SLIDE 34

Open Source Deep Learning Frameworks

  • Python, C++ APIs
  • Used and maintained by Google
  • Fast GPU backend (via cudnn)
  • Less control over dynamic execution than torch
  • Support for distributed training now in open source

https://www.tensorflow.org

slide-35
SLIDE 35

Deep learning for lemma selection

  • Collaboration between

○ Josef Urban’s group ○ Google Research

  • Input from the Mizar corpus:

○ Set of known lemmas ○ Proposition to prove

  • Pick small subset of lemmas to give to E Prover
slide-36
SLIDE 36

Deep learning for lemma selection

  • Simplified goal:

○ Rank lemmas by usefulness for a given conjecture

  • Embed lemma into using an LSTM
  • Embed conjecture into using a different LSTM
  • Combine embeddings to estimate usefulness

conjecture lemma LSTM LSTM FC FC softmax

slide-37
SLIDE 37

Thank you!