Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain - - PowerPoint PPT Presentation

neural networks
SMART_READER_LITE
LIVE PREVIEW

Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain - - PowerPoint PPT Presentation

Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain 2 NEURAL NETWORKS What well cover ... f ( x ) types of learning problems - definitions of popular learning problems - how to define an architecture for a learning


slide-1
SLIDE 1

Neural Networks

Hugo Larochelle ( @hugo_larochelle ) Google Brain

slide-2
SLIDE 2

NEURAL NETWORKS

2

  • What we’ll cover
  • types of learning problems
  • definitions of popular learning problems
  • how to define an architecture for a learning problem
  • unintuitive properties of neural networks
  • adversarial examples
  • optimization landscape of neural networks

...

  • x1

xd

...

xj

1 1

... ...

1

... ... ...

  • f(x)

x

slide-3
SLIDE 3

Neural Networks

Types of learning problems

slide-4
SLIDE 4

SUPERVISED LEARNING

4

Topics: supervised learning

  • Training time
  • data :



 


  • setting :
  • Test time
  • data :



 


  • setting :

{x(t), y(t)} {x(t), y(t)}

  • Example
  • classification
  • regression

x(t), y(t) ∼ p(x, y) x(t), y(t) ∼ p(x, y)

slide-5
SLIDE 5

UNSUPERVISED LEARNING

5

Topics: unsupervised learning

  • Training time
  • data :



 


  • setting :
  • Test time
  • data :



 


  • setting :

{x(t)} {x(t)} x(t) ∼ p(x) x(t) ∼ p(x)

  • Example
  • distribution estimation
  • dimensionality

reduction

slide-6
SLIDE 6

SEMI-SUPERVISED LEARNING

6

Topics: semi-supervised learning

  • Training time
  • data :



 


  • setting :
  • Test time
  • data :



 


  • setting :

{x(t), y(t)} {x(t), y(t)} {x(t)} x(t) ∼ p(x) x(t), y(t) ∼ p(x, y) x(t), y(t) ∼ p(x, y)

slide-7
SLIDE 7

MULTITASK LEARNING

7

Topics: multitask learning

  • Training time
  • data :



 


  • setting :
  • Test time
  • data :



 


  • setting :

{x(t), y(t)

1 , . . . , y(t) M }

{x(t), y(t)

1 , . . . , y(t) M }

x(t), y(t)

1 , . . . , y(t) M ∼

p(x, y1, . . . , yM) x(t), y(t)

1 , . . . , y(t) M ∼

p(x, y1, . . . , yM)

  • Example
  • object recognition in

images with multiple

  • bjects
slide-8
SLIDE 8

MULTITASK LEARNING

8

Topics: multitask learning

...

  • x1

xd

...

xj

... ... ... ... ...

  • h(1)(x)

h(2)(x)

) W(1)

W(2)

y

slide-9
SLIDE 9

MULTITASK LEARNING

8

Topics: multitask learning

...

  • x1

xd

...

xj

... ... ... ... ...

  • h(1)(x)

h(2)(x)

) W(1)

W(2)

y

... ...

y3 y1

2

slide-10
SLIDE 10

TRANSFER LEARNING

9

Topics: transfer learning

  • Training time
  • data :



 


  • setting :
  • Test time
  • data :



 


  • setting :

{x(t), y(t)

1 , . . . , y(t) M }

x(t), y(t)

1 , . . . , y(t) M ∼

p(x, y1, . . . , yM) {x(t), y(t)

1 }

x(t), y(t)

1

∼ p(x, y1)

slide-11
SLIDE 11

STRUCTURED OUTPUT PREDICTION

10

Topics: structured output prediction

  • Training time
  • data :



 


  • setting :
  • Test time
  • data :



 


  • setting :
  • Example
  • image caption

generation

  • machine translation

x(t), y(t) ∼ p(x, y) x(t), y(t) ∼ p(x, y) {x(t), y(t)} {x(t), y(t)}

  • f arbitrary structure

(vector, sequence, graph)

slide-12
SLIDE 12

DOMAIN ADAPTATION

11

Topics: domain adaptation, covariate shift

  • Training time
  • data :



 


  • setting :
  • Test time
  • data :



 


  • setting :

{x(t), y(t)} x(t) ∼ p(x) ≈ p(x)

  • Example
  • classify sentiment in

reviews of different products

  • training on synthetic

data but testing on real data (sim2real)

y(t) ∼ p(y|x(t)) {¯ x(t), y(t)} ¯ x(t) ∼ q(x) y(t) ∼ p(y|¯ x(t)) {¯ x(t0)} ¯ x(t) ∼ q(x)

slide-13
SLIDE 13

DOMAIN ADAPTATION

12

Topics: domain adaptation, covariate shift

x

h(x) =

f(x)

W V

b

c

  • (h(x))

w

d

  • Domain-adversarial networks (Ganin et al. 2015)


train hidden layer representation to be

  • 1. predictive of the target class
  • 2. indiscriminate of the domain
  • Trained by stochastic gradient descent
  • for each random pair
  • 1. update W,V,b,c in opposite direction of gradient
  • 2. update w,d in direction of gradient

x(t), ¯ x(t0)

slide-14
SLIDE 14

DOMAIN ADAPTATION

12

Topics: domain adaptation, covariate shift

x

h(x) =

f(x)

W V

b

c

  • (h(x))

w

d

  • Domain-adversarial networks (Ganin et al. 2015)


train hidden layer representation to be

  • 1. predictive of the target class
  • 2. indiscriminate of the domain
  • Trained by stochastic gradient descent
  • for each random pair
  • 1. update W,V,b,c in opposite direction of gradient
  • 2. update w,d in direction of gradient

x(t), ¯ x(t0)

May also be used to promote
 fair and unbiased models …

slide-15
SLIDE 15

ONE-SHOT LEARNING

13

Topics: one-shot learning

  • Training time
  • data :



 


  • setting :
  • Test time
  • data :



 


  • setting :



 
 


  • side information :
  • a single labeled example from

each of the M new classes

{x(t), y(t)} {x(t), y(t)}

  • Example
  • recognizing a person

based on a single picture of him/her subject to y(t) ∈ {1, . . . , C}

y(t) ∈ {C + 1, . . . , C + M}

subject to

x(t), y(t) ∼ p(x, y) x(t), y(t) ∼ p(x, y)

slide-16
SLIDE 16

ONE-SHOT LEARNING

14

Topics: one-shot learning

W W W W W W W W

500 500 500 500 2000 30 2000

1 2 3 4

30

1 2 3 4

y X X

a b

y

a b

D[y ,y ]

a b

Siamese architecture (figure taken from Salakhutdinov 
 and Hinton, 2007)

slide-17
SLIDE 17

ZERO-SHOT LEARNING

15

Topics: zero-shot learning, zero-data learning

  • Training time
  • data :



 


  • setting :



 
 


  • side information :
  • description vector zc of each of

the C classes

  • Test time
  • data :



 


  • setting :



 
 


  • side information :
  • description vector zc of each of

the new M classes

{x(t), y(t)} {x(t), y(t)}

  • Example
  • recognizing an object

based on a worded description of it subject to y(t) ∈ {1, . . . , C}

y(t) ∈ {C + 1, . . . , C + M}

subject to

x(t), y(t) ∼ p(x, y) x(t), y(t) ∼ p(x, y)

slide-18
SLIDE 18

ZERO-SHOT LEARNING

16

Topics: zero-shot learning, zero-data learning

CNN MLP Class score Dot product Wikipedia article TF-IDF Image g f

The Cardinals or Cardinalidae are a family of passerine birds found in North and South America The South American cardinals in the genus…

family north genus birds south america …

Cxk 1xk 1xC

Ba, Swersky, Fidler, Salakhutdinov arxiv 2015

slide-19
SLIDE 19

DESIGNING NEW ARCHITECTURES

17

Topics: designing new architectures

  • Tackling a new learning problem often requires designing 


an adapted neural architecture

  • Approach 1: use our intuition for how a human would

reason about the problem

  • Approach 2: take an existing algorithm/procedure and 


turn it into a neural network

slide-20
SLIDE 20

DESIGNING NEW ARCHITECTURES

18

Topics: designing new architectures

  • Many other examples
  • structured prediction by unrolling probabilistic inference in an MRF
  • planning by unrolling the value iteration algorithm


(Tamar et al., NIPS 2016)

  • few-shot learning by unrolling gradient descent on small training set

Neural
 network Learning
 algorithm

Ravi and Larochelle, ICLR 2017

_ _

_ _

slide-21
SLIDE 21

Neural networks

Unintuitive properties of neural networks

slide-22
SLIDE 22

THEY CAN MAKE DUMB ERRORS

20

Topics: adversarial examples

  • Intriguing Properties of Neural Networks


Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, Fergus, ICLR 2014

Correctly
 classified Badly
 classified Difference

slide-23
SLIDE 23

THEY CAN MAKE DUMB ERRORS

21

Topics: adversarial examples

  • Humans have adversarial examples too
  • However they don’t match those of neural networks
slide-24
SLIDE 24

22

Topics: adversarial examples

  • Humans have adversarial examples too
  • However they don’t match those of neural networks

THEY CAN MAKE DUMB ERRORS

slide-25
SLIDE 25

THEY ARE STRANGELY NON-CONVEX

23

Topics: non-convexity, saddle points

  • Identifying and attacking the saddle point problem in high-dimensional non-convex optimization


Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014

avg loss θ

slide-26
SLIDE 26

THEY ARE STRANGELY NON-CONVEX

23

Topics: non-convexity, saddle points

  • Identifying and attacking the saddle point problem in high-dimensional non-convex optimization


Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014

avg loss θ

slide-27
SLIDE 27

THEY ARE STRANGELY NON-CONVEX

24

Topics: non-convexity, saddle points

  • Identifying and attacking the saddle point problem in high-dimensional non-convex optimization


Dauphin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, NIPS 2014

slide-28
SLIDE 28

THEY ARE STRANGELY NON-CONVEX

25

Topics: non-convexity, saddle points

  • Qualitatively Characterizing Neural Network Optimization Problems


Goodfellow, Vinyals, Saxe, ICLR 2015

slide-29
SLIDE 29

THEY ARE STRANGELY NON-CONVEX

26

Topics: non-convexity, saddle points

  • If dataset is created by labeling points using a N-hidden units neural network
  • training another N-hidden units network is likely to fail
  • but training a larger neural network is more likely to work! 


(saddle points seem to be a blessing)

slide-30
SLIDE 30

THEY WORK BEST WHEN BADLY TRAINED

27

Topics: sharp vs. flat miniman

  • Flat Minima


Hochreiter, Schmidhuber, Neural Computation 1997

Flat Minimum Sharp Minimum Training Function Testing Function x)

avg loss θ

slide-31
SLIDE 31

THEY WORK BEST WHEN BADLY TRAINED

28

Topics: sharp vs. flat miniman

  • On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima


Keskar, Mudigere, Nocedal, Smelyanskiy, Tang, ICLR 2017

  • found that using large batch sizes tends to find sharper minima and generalize worse
  • This means that we can’t talk about generalization without taking the training

algorithm into account

slide-32
SLIDE 32

THEY CAN EASILY MEMORIZE

29

Topics: model capacity vs. training algorithm

  • Understanding Deep Learning Requires Rethinking Generalization


Zhang, Bengio, Hardt, Recth, Vinyals, ICLR 2017

slide-33
SLIDE 33

THEY CAN BE COMPRESSED

30

Topics: knowledge distillation

  • Distilling the Knowledge in a Neural Network


Hinton, Vinyals, Dean, arXiv 2015

...

  • x1

xd

...

xj

... ... ... ... ...

slide-34
SLIDE 34

THEY CAN BE COMPRESSED

31

Topics: knowledge distillation

  • Distilling the Knowledge in a Neural Network


Hinton, Vinyals, Dean, arXiv 2015

...

  • x1

xd

...

xj

... ... ... ... ... ...

  • x1

xd

...

xj

... ... ...

slide-35
SLIDE 35

THEY CAN BE COMPRESSED

32

Topics: knowledge distillation

  • Distilling the Knowledge in a Neural Network


Hinton, Vinyals, Dean, arXiv 2015

...

  • x1

xd

...

xj

... ... ... ... ... ...

  • x1

xd

...

xj

... ... ...

y

slide-36
SLIDE 36

THEY CAN BE COMPRESSED

33

Topics: knowledge distillation

  • Can successfully distill
  • a large neural network
  • an ensemble of neural network
  • Works better than training it from scratch!
  • Do Deep Nets Really Need to be Deep?


Jimmy Ba, Rich Caruana, NIPS 2014

slide-37
SLIDE 37

THEY ARE INFLUENCED BY INITIALIZATION

34

Topics: impact of initialization

  • Why Does Unsupervised Pre-Training Help Deep Learning


Erhan, Bengio, Courville, Manzagol, Vincent, JMLR 2010

−100 −80 −60 −40 −20 20 40 60 80 100 −100 −80 −60 −40 −20 20 40 60 80 100

2 layers without pre−training 2 layers with pre−training

slide-38
SLIDE 38

THEY ARE INFLUENCED BY FIRST EXAMPLES

35

Topics: impact of early examples

  • Why Does Unsupervised Pre-Training Help Deep Learning


Erhan, Bengio, Courville, Manzagol, Vincent, JMLR 2010

slide-39
SLIDE 39

YET THEY FORGET WHAT THEY LEARNED

36

Topics: lifelong learning, continual learning

  • Overcoming Catastrophic Forgetting in Neural Networks


Kirkpatrick et al. PNAS 2017

slide-40
SLIDE 40

SO THERE IS A LOT 
 MORE TO UNDERSTAND!!

37

slide-41
SLIDE 41

MERCI!

38