Neural networks and Reinforcement learning review CS 540 Yingyu - - PowerPoint PPT Presentation

β–Ά
neural networks and
SMART_READER_LITE
LIVE PREVIEW

Neural networks and Reinforcement learning review CS 540 Yingyu - - PowerPoint PPT Presentation

Neural networks and Reinforcement learning review CS 540 Yingyu Liang Neural Networks Outline Building unit: neuron Linear perceptron Non-linear perceptron The power/limit of a single perceptron Learning of a single


slide-1
SLIDE 1

Neural networks and Reinforcement learning review

CS 540 Yingyu Liang

slide-2
SLIDE 2

Neural Networks

slide-3
SLIDE 3

Outline

  • Building unit: neuron
  • Linear perceptron
  • Non-linear perceptron
  • The power/limit of a single perceptron
  • Learning of a single perceptron
  • Neural network: a network of neurons
  • Layers, hidden units
  • Learning of neural network: backpropagation (gradient descent)
slide-4
SLIDE 4

Linear perceptron

  • Input: 𝑦1, 𝑦2, … , 𝑦𝐸 (For notation simplicity, define 𝑦0 = 1)
  • Weights: π‘₯1, π‘₯2, … , π‘₯𝐸
  • Bias: π‘₯0
  • Output: 𝑏 = σ𝑒=0

𝐸

π‘₯𝑒𝑦𝑒

… 1

slide-5
SLIDE 5

Nonlinear perceptron

  • Input: 𝑦1, 𝑦2, … , 𝑦𝐸 (For notation simplicity, define 𝑦0 = 1)
  • Weights: π‘₯1, π‘₯2, … , π‘₯𝐸
  • Bias: π‘₯0
  • Activation function: 𝑕 𝑨 = step 𝑨 , sigmoid 𝑨 , relu 𝑨 , …
  • Output: 𝑏 = 𝑕(σ𝑒=0

𝐸

π‘₯𝑒𝑦𝑒)

… 1

slide-6
SLIDE 6

Example Question

? Weather Company Proximity

  • Will you go to the festival?
  • Go only if Weather is favorable and at least one of the other two

conditions is favorable

All inputs are binary; 1 is favorable

slide-7
SLIDE 7

slide 7

Multi-layer neural networks

  • Training: encode a label 𝑧 by an indicator vector

β–ͺ class1=(1,0,0,…,0), class2=(0,1,0,…,0) etc.

  • Test: choose the class corresponding to the largest output unit

π‘₯11

(2)

π‘₯21

(2)

π‘₯31

(2)

π‘₯12

(2)

π‘₯22

(2)

π‘₯32

(2)

𝑏1

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯1𝑒

(2)

𝑏2

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯2𝑒

(2)

𝑏3

2 = 𝑕

෍

𝑒

𝑦𝑒π‘₯3𝑒

(2)

π‘₯11

(3)

π‘₯12

(3)

π‘₯13

(3)

𝑏1 = 𝑕 ෍

𝑗

𝑏𝑗

2 π‘₯1𝑗 (3)

𝑦2 𝑦1

𝑏𝐿 = 𝑕 ෍

𝑗

𝑏𝑗

2 π‘₯𝐿𝑗 (3)

π‘₯𝐿3

(3)

π‘₯𝐿1

(3)

π‘₯𝐿2

(3)

…

slide-8
SLIDE 8

slide 8

Learning in neural network

  • Again we will minimize the error (𝐿 outputs):
  • 𝑦: one training point in the training set 𝐸
  • 𝑏𝑑: the 𝑑-th output for the training point 𝑦
  • 𝑧𝑑: the 𝑑-th element of the label indicator vector for 𝑦

𝐹 = 1 2 ෍

π‘¦βˆˆπΈ

𝐹𝑦 , 𝐹𝑦 = 𝑧 βˆ’ 𝑏 2 = ෍

𝑑=1 𝐿

𝑏𝑑 βˆ’ 𝑧𝑑 2

𝑦2 𝑦1

…

𝑏1 𝑏𝐿 1 … = 𝑧

slide-9
SLIDE 9

slide 9

Backpropagation

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2 π‘₯11

(4)

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦 𝐹𝑦 πœ€1

(4)

𝑨1

(4)

π‘₯12

4 𝑏2 (3)

π‘₯11

4 𝑏1 (3)

πœ–πΉπ’š πœ–π‘₯11

(4) = πœ€1 (4)𝑏1 (3)

By Chain Rule:

πœ€1

(4) = πœ–πΉπ’š

πœ–π‘¨1

(4) = 2(𝑏1 βˆ’ 𝑧1)𝑕′ 𝑨1 (4)

𝑏1

(3)

slide-10
SLIDE 10

slide 10

πœ€2

(4)

πœ€1

(3)

πœ€2

(3)

πœ€1

(2)

πœ€2

(2)

πœ€1

(4)

Backpropagation of πœ€

𝑦2 𝑦1 = 𝑧 βˆ’ 𝑏 2 𝑏1 𝑏2

Layer (4) Layer (3) Layer (2) Layer (1)

𝐹𝑦

Thus, for any neuron in the network: πœ€

π‘˜ (π‘š) = ෍ 𝑙

πœ€π‘™

π‘š+1 π‘₯π‘™π‘˜ π‘š+1

𝑕′ 𝑨

π‘˜ π‘š

πœ€

π‘˜ (π‘š)

: πœ€ of π‘˜π‘’β„Ž Neuron in Layer π‘š πœ€π‘™

(π‘š+1)

: πœ€ of π‘™π‘’β„Ž Neuron in Layer π‘š + 1 𝑕′ 𝑨

π‘˜ π‘š

: derivative of π‘˜π‘’β„Ž Neuron in Layer π‘š w.r.t. its linear combination input π‘₯π‘™π‘˜

(π‘š+1)

: Weight from π‘˜π‘’β„Ž Neuron in Layer π‘š to π‘™π‘’β„Ž Neuron in Layer π‘š + 1

slide-11
SLIDE 11

Example Question

slide-12
SLIDE 12

Example Question

slide-13
SLIDE 13

Example Question

slide-14
SLIDE 14

Example Question

slide-15
SLIDE 15

Convolution: discrete version

  • Given array 𝑣𝑒 and π‘₯𝑒, their convolution is a function 𝑑𝑒
  • Written as
  • When 𝑣𝑒 or π‘₯𝑒 is not defined, assumed to be 0

𝑑𝑒 = ෍

𝑏=βˆ’βˆž +∞

𝑣𝑏π‘₯π‘’βˆ’π‘ 𝑑 = 𝑣 βˆ— π‘₯

  • r

𝑑𝑒 = 𝑣 βˆ— π‘₯ 𝑒

slide-16
SLIDE 16

Convolution illustration

a b c d e f x y z xb+yc+zd π‘₯= [z, y, x] 𝑣 = [a, b, c, d, e, f]

𝑑3

π±πŸ‘ 𝐱𝟐 𝐱𝟏 𝐯𝟐 π’—πŸ‘ π―πŸ’

slide-17
SLIDE 17

Pooling illustration

a b c d e f Max(b,c,d) 𝑣 = [a, b, c, d, e, f]

𝐯𝟐 π’—πŸ‘ π―πŸ’

slide-18
SLIDE 18

Example question

1 2 3 4 5 6 1 1

  • 1

π‘₯= [-1,1,1] 𝑣 = [1,2,3,4,5,6]

π±πŸ‘ 𝐱𝟐 𝐱𝟏

What is the value 𝑑 = 𝑣 βˆ— π‘₯ ? (Valid padding)

slide-19
SLIDE 19

Reinforcement Learning

slide-20
SLIDE 20

Outline

  • The reinforcement learning task
  • Markov decision process
  • Value functions
  • Value iteration
  • Q functions
  • Q learning
slide-21
SLIDE 21

Reinforcement learning as a Markov decision process (MDP)

agent environment state reward action

s0 s1 s2 a0 a1 a2 r0 r1 r2

  • Markov assumption
  • also assume reward is Markovian

Goal: learn a policy Ο€ : S β†’ A for choosing actions that maximizes for every possible starting state s0

21

) , | ( ,...) , , , | (

1 1 1 1 t t t t t t t t

a s s P a s a s s P

+ βˆ’ βˆ’ +

= ) , | ( ,...) , , , | (

1 1 1 1 t t t t t t t t

a s r P a s a s r P

+ βˆ’ βˆ’ +

= 1 where ...] [

2 2 1

ο€Ό ο‚£ + + +

+ +

  

t t t

r r r E

slide-22
SLIDE 22

Value function for a policy

  • given a policy Ο€ : S β†’ A define

assuming action sequence chosen according to Ο€ starting at state s

  • we want the optimal policy Ο€* where

p * = argmaxp V p (s) for all s

we’ll denote the value function for this optimal policy as V*(s)

22

οƒ₯

ο‚₯ =

= ] [ ) (

t t t

r E s V 



slide-23
SLIDE 23

Value iteration for learning V*(s)

initialize V(s) arbitrarily loop until policy good enough { loop for s ∈ S { loop for a ∈ A { } }

}

23

οƒ₯

οƒŽ

+ 

S s

s V a s s P a s r a s Q

'

) ' ( ) , | ' ( ) , ( ) , (  ) , ( max ) ( a s Q s V

a



slide-24
SLIDE 24

Q function

define a new function, closely related to V* if agent knows Q(s, a), it can choose optimal action without knowing P(s’ | s, a) and it can learn Q(s, a) without knowing P(s’ | s, a)

24

 

 

) ' ( ) , ( ) , (

* , | '

s V E a s r E a s Q

a s s

 +  ) , ( max ) (

*

a s Q s V

a

 ) , ( max arg ) (

*

a s Q s

a

 

slide-25
SLIDE 25

Q learning for deterministic worlds

for each s, a initialize table entry

  • bserve current state s

do forever select an action a and execute it receive immediate reward r

  • bserve the new state s’

update table entry s ← s’

25

) ' , ' ( Λ† max ) , ( Λ†

'

a s Q r a s Q

a

 +  ) , ( Λ†  a s Q

slide-26
SLIDE 26

Example question

slide-27
SLIDE 27

Example question