Neural networks and Reinforcement learning review
CS 540 Yingyu Liang
Neural networks and Reinforcement learning review CS 540 Yingyu - - PowerPoint PPT Presentation
Neural networks and Reinforcement learning review CS 540 Yingyu Liang Neural Networks Outline Building unit: neuron Linear perceptron Non-linear perceptron The power/limit of a single perceptron Learning of a single
CS 540 Yingyu Liang
πΈ
π₯ππ¦π
β¦ 1
πΈ
π₯ππ¦π)
β¦ 1
? Weather Company Proximity
conditions is favorable
All inputs are binary; 1 is favorable
slide 7
Multi-layer neural networks
βͺ class1=(1,0,0,β¦,0), class2=(0,1,0,β¦,0) etc.
π₯11
(2)
π₯21
(2)
π₯31
(2)
π₯12
(2)
π₯22
(2)
π₯32
(2)
π1
2 = π
ΰ·
π
π¦ππ₯1π
(2)
π2
2 = π
ΰ·
π
π¦ππ₯2π
(2)
π3
2 = π
ΰ·
π
π¦ππ₯3π
(2)
π₯11
(3)
π₯12
(3)
π₯13
(3)
π1 = π ΰ·
π
ππ
2 π₯1π (3)
π¦2 π¦1
ππΏ = π ΰ·
π
ππ
2 π₯πΏπ (3)
π₯πΏ3
(3)
π₯πΏ1
(3)
π₯πΏ2
(3)
slide 8
Learning in neural network
πΉ = 1 2 ΰ·
π¦βπΈ
πΉπ¦ , πΉπ¦ = π§ β π 2 = ΰ·
π=1 πΏ
ππ β π§π 2
π¦2 π¦1
π1 ππΏ 1 β¦ = π§
slide 9
Backpropagation
π¦2 π¦1 = π§ β π 2 π1 π2 π₯11
(4)
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦ πΉπ¦ π1
(4)
π¨1
(4)
π₯12
4 π2 (3)
π₯11
4 π1 (3)
ππΉπ ππ₯11
(4) = π1 (4)π1 (3)
By Chain Rule:
π1
(4) = ππΉπ
ππ¨1
(4) = 2(π1 β π§1)πβ² π¨1 (4)
π1
(3)
slide 10
π2
(4)
π1
(3)
π2
(3)
π1
(2)
π2
(2)
π1
(4)
Backpropagation of π
π¦2 π¦1 = π§ β π 2 π1 π2
Layer (4) Layer (3) Layer (2) Layer (1)
πΉπ¦
Thus, for any neuron in the network: π
π (π) = ΰ· π
ππ
π+1 π₯ππ π+1
πβ² π¨
π π
π
π (π)
: π of ππ’β Neuron in Layer π ππ
(π+1)
: π of ππ’β Neuron in Layer π + 1 πβ² π¨
π π
: derivative of ππ’β Neuron in Layer π w.r.t. its linear combination input π₯ππ
(π+1)
: Weight from ππ’β Neuron in Layer π to ππ’β Neuron in Layer π + 1
π‘π’ = ΰ·
π=ββ +β
π£ππ₯π’βπ π‘ = π£ β π₯
π‘π’ = π£ β π₯ π’
a b c d e f x y z xb+yc+zd π₯= [z, y, x] π£ = [a, b, c, d, e, f]
π‘3
π±π π±π π±π π―π ππ π―π
a b c d e f Max(b,c,d) π£ = [a, b, c, d, e, f]
π―π ππ π―π
1 2 3 4 5 6 1 1
π₯= [-1,1,1] π£ = [1,2,3,4,5,6]
π±π π±π π±π
What is the value π‘ = π£ β π₯ ? (Valid padding)
agent environment state reward action
s0 s1 s2 a0 a1 a2 r0 r1 r2
Goal: learn a policy Ο : S β A for choosing actions that maximizes for every possible starting state s0
21
) , | ( ,...) , , , | (
1 1 1 1 t t t t t t t t
a s s P a s a s s P
+ β β +
= ) , | ( ,...) , , , | (
1 1 1 1 t t t t t t t t
a s r P a s a s r P
+ β β +
= 1 where ...] [
2 2 1
οΌ ο£ + + +
+ +
ο§ ο§ ο§
t t t
r r r E
assuming action sequence chosen according to Ο starting at state s
p * = argmaxp V p (s) for all s
weβll denote the value function for this optimal policy as V*(s)
22
ο₯ =
= ] [ ) (
t t t
r E s V ο§
ο°
initialize V(s) arbitrarily loop until policy good enough { loop for s β S { loop for a β A { } }
}
23
ο
+ ο¬
S s
s V a s s P a s r a s Q
'
) ' ( ) , | ' ( ) , ( ) , ( ο§ ) , ( max ) ( a s Q s V
a
ο¬
define a new function, closely related to V* if agent knows Q(s, a), it can choose optimal action without knowing P(sβ | s, a) and it can learn Q(s, a) without knowing P(sβ | s, a)
24
ο ο
) ' ( ) , ( ) , (
* , | '
s V E a s r E a s Q
a s s
ο§ + ο¬ ) , ( max ) (
*
a s Q s V
a
ο¬ ) , ( max arg ) (
*
a s Q s
a
ο¬ ο°
for each s, a initialize table entry
do forever select an action a and execute it receive immediate reward r
update table entry s β sβ
25
) ' , ' ( Λ max ) , ( Λ
'
a s Q r a s Q
a
ο§ + ο¬ ) , ( Λ ο¬ a s Q