Deep learning
Deep learning
Deep reinforcement learning Hamid Beigy
Sharif university of technology
December 25, 2018
Hamid Beigy | Sharif university of technology | December 25, 2018 1 / 65
Deep learning Deep reinforcement learning Hamid Beigy Sharif - - PowerPoint PPT Presentation
Deep learning Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December 25, 2018 Hamid Beigy | Sharif university of technology | December 25, 2018 1 / 65 Deep learning Table of contents 1 Introduction 2
Deep learning
Sharif university of technology
Hamid Beigy | Sharif university of technology | December 25, 2018 1 / 65
Deep learning
Hamid Beigy | Sharif university of technology | December 25, 2018 2 / 65
Deep learning | Introduction
Hamid Beigy | Sharif university of technology | December 25, 2018 3 / 65
Deep learning | Introduction
Computer Science Economics Mathematics Engineering Neuroscience Psychology Machine Learning Classical/Operant Conditioning Optimal Control Reward System Operations Research Rationality/ Game Theory Reinforcement Learning Hamid Beigy | Sharif university of technology | December 25, 2018 4 / 65
Deep learning | Introduction
Hamid Beigy | Sharif university of technology | December 25, 2018 5 / 65
Deep learning | Introduction
Hamid Beigy | Sharif university of technology | December 25, 2018 6 / 65
Deep learning | Introduction
Hamid Beigy | Sharif university of technology | December 25, 2018 7 / 65
Deep learning | Introduction
Hamid Beigy | Sharif university of technology | December 25, 2018 8 / 65
Deep learning | Introduction
. .
starting position
b c* d e e*
c
g
.
Hamid Beigy | Sharif university of technology | December 25, 2018 9 / 65
Deep learning | Introduction
Agent Environment
action at st reward rt rt+1 st+1 state Hamid Beigy | Sharif university of technology | December 25, 2018 10 / 65
Deep learning | Non-associative reinforcement learning
Hamid Beigy | Sharif university of technology | December 25, 2018 11 / 65
Deep learning | Non-associative reinforcement learning
Hamid Beigy | Sharif university of technology | December 25, 2018 12 / 65
Deep learning | Non-associative reinforcement learning
Hamid Beigy | Sharif university of technology | December 25, 2018 13 / 65
Deep learning | Non-associative reinforcement learning
a
b expQt(b)/τ
Hamid Beigy | Sharif university of technology | December 25, 2018 14 / 65
Deep learning | Non-associative reinforcement learning
1 α = {α1, α2, . . . , αr} shows a set of inputs, 2 β = {0, 1} represents the set of values that the reinforcement signal
3 C = {c1, c2, . . . , cr} is the set of penalty probabilities, where
1 β = {0, 1} is a set of inputs, 2 α = {α1, α2, . . . , αr} is a set of actions, 3 T is a learning algorithm used to modify action probability vector p.
Hamid Beigy | Sharif university of technology | December 25, 2018 15 / 65
Deep learning | Non-associative reinforcement learning
b r−1 + pj(k)(1 − b)
Hamid Beigy | Sharif university of technology | December 25, 2018 16 / 65
Deep learning | Non-associative reinforcement learning
r
i=1
k→∞ E[M(k)] < M(0)
k→∞ E[M(k)] = min i
k→∞ E[M(k)] < min i
Hamid Beigy | Sharif university of technology | December 25, 2018 17 / 65
Deep learning | Associative reinforcement learning
Hamid Beigy | Sharif university of technology | December 25, 2018 18 / 65
Deep learning | Associative reinforcement learning
Hamid Beigy | Sharif university of technology | December 25, 2018 19 / 65
Deep learning | Goals,rewards, and returns
Hamid Beigy | Sharif university of technology | December 25, 2018 20 / 65
Deep learning | Goals,rewards, and returns
∞
k=0
Hamid Beigy | Sharif university of technology | December 25, 2018 21 / 65
Deep learning | Markov decision process
Hamid Beigy | Sharif university of technology | December 25, 2018 22 / 65
Deep learning | Markov decision process
ss′
ss′
search
high low
1, 0 1–β , –3 search recharge wait wait
search
1–α , R β , R
search
α, Rsearch 1, Rwait 1, Rwait
Hamid Beigy | Sharif university of technology | December 25, 2018 23 / 65
Deep learning | Markov decision process
k=0
π
s′
ss′
ss′ + γV π(s′)
k=0
Hamid Beigy | Sharif university of technology | December 25, 2018 24 / 65
Deep learning | Markov decision process
π
π
s,a s a s' r a' s' r (b) (a)
max max
Hamid Beigy | Sharif university of technology | December 25, 2018 25 / 65
Deep learning | Markov decision process
Hamid Beigy | Sharif university of technology | December 25, 2018 26 / 65
Deep learning | Model based methods
Hamid Beigy | Sharif university of technology | December 25, 2018 27 / 65
Deep learning | Model based methods
a
a
s′
ss′
ss′ + γV ∗(s′)
a′ Q∗(st+1, a′)|st = s, at = a}
s′
ss′
ss′ + γ max a′ Q∗(s′, a′)
Hamid Beigy | Sharif university of technology | December 25, 2018 28 / 65
Deep learning | Model based methods
E
I
E
I
E
I
E
k=0
π
s′
ss′
ss′ + γV π(s′)
a
a
s′
ss′
ss′ + γV π(s′)
Hamid Beigy | Sharif university of technology | December 25, 2018 29 / 65
Deep learning | Model based methods
a
a
s′
ss′
ss′ + γV ( ks′)
Hamid Beigy | Sharif university of technology | December 25, 2018 30 / 65
Deep learning | Model based methods
T! T! T! T!
T! T! T! T! T! T! T! T! T!
Hamid Beigy | Sharif university of technology | December 25, 2018 31 / 65
Deep learning | Value-based methods
Hamid Beigy | Sharif university of technology | December 25, 2018 32 / 65
Deep learning | Value-based methods
Hamid Beigy | Sharif university of technology | December 25, 2018 33 / 65
Deep learning | Value-based methods | Monte Carlo methods
α1
R1 S2 α2
R2 S3 α3
R3 S4 . . . αk−1
Rk−1
Hamid Beigy | Sharif university of technology | December 25, 2018 34 / 65
Deep learning | Value-based methods | Monte Carlo methods
Hamid Beigy | Sharif university of technology | December 25, 2018 35 / 65
Deep learning | Value-based methods | Monte Carlo methods
Hamid Beigy | Sharif university of technology | December 25, 2018 36 / 65
Deep learning | Value-based methods | Monte Carlo methods
T! T! T! T! T! T! T! T! T! T!
T! T! T! T! T! T! T! T! T! T!
Hamid Beigy | Sharif university of technology | December 25, 2018 37 / 65
Deep learning | Value-based methods | Temporal-difference methods
Hamid Beigy | Sharif university of technology | December 25, 2018 38 / 65
Deep learning | Value-based methods | Temporal-difference methods
T! T! T! T! T! T! T! T! T! T!
t+1
T! T! T! T! T! T! T! T! T! T! Hamid Beigy | Sharif university of technology | December 25, 2018 39 / 65
Deep learning | Value-based methods | Temporal-difference methods
Hamid Beigy | Sharif university of technology | December 25, 2018 40 / 65
Deep learning | Value-based methods | Temporal-difference methods
Hamid Beigy | Sharif university of technology | December 25, 2018 41 / 65
Deep learning | Value-based methods | Temporal-difference methods
a
Hamid Beigy | Sharif university of technology | December 25, 2018 42 / 65
Deep learning | Policy-based methods
Hamid Beigy | Sharif university of technology | December 25, 2018 43 / 65
Deep learning | Policy-based methods
Hamid Beigy | Sharif university of technology | December 25, 2018 44 / 65
Deep learning | Policy-based methods
Hamid Beigy | Sharif university of technology | December 25, 2018 45 / 65
Deep learning | Deep reinforcement learning
Hamid Beigy | Sharif university of technology | December 25, 2018 46 / 65
Deep learning | Deep reinforcement learning
state reward action at rt st
Hamid Beigy | Sharif university of technology | December 25, 2018 47 / 65
Deep learning | Deep reinforcement learning
Hamid Beigy | Sharif university of technology | December 25, 2018 48 / 65
Deep learning | Value-Based Deep RL
Hamid Beigy | Sharif university of technology | December 25, 2018 49 / 65
Deep learning | Value-Based Deep RL
s s a Q(s,a,w) Q(s,a1,w) Q(s,am,w) … w w
Hamid Beigy | Sharif university of technology | December 25, 2018 50 / 65
Deep learning | Value-Based Deep RL
1Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G.,
Hamid Beigy | Sharif university of technology | December 25, 2018 51 / 65
Deep learning | Policy-Based Deep RL
Hamid Beigy | Sharif university of technology | December 25, 2018 52 / 65
Deep learning | Policy-Based Deep RL
Hamid Beigy | Sharif university of technology | December 25, 2018 53 / 65
Deep learning | AlphaGo
Hamid Beigy | Sharif university of technology | December 25, 2018 54 / 65
Deep learning | AlphaGo
Hamid Beigy | Sharif university of technology | December 25, 2018 55 / 65
Deep learning | AlphaGo
2Silver, David, et al. ”Mastering the game of Go with deep neural networks and tree
3Silver, David, et al. ”Mastering the game of go without human knowledge.” Nature
Hamid Beigy | Sharif university of technology | December 25, 2018 56 / 65
Deep learning | AlphaGo
Credit: Silver (IJCAI 2017)
Hamid Beigy | Sharif university of technology | December 25, 2018 57 / 65
Deep learning | AlphaGo
Hamid Beigy | Sharif university of technology | December 25, 2018 58 / 65
Deep learning | AlphaGo
Hamid Beigy | Sharif university of technology | December 25, 2018 59 / 65
Deep learning | AlphaGo
Hamid Beigy | Sharif university of technology | December 25, 2018 60 / 65
Deep learning | AlphaGo
Hamid Beigy | Sharif university of technology | December 25, 2018 61 / 65
Deep learning | AlphaGo
Hamid Beigy | Sharif university of technology | December 25, 2018 62 / 65
Deep learning | AlphaGo
Hamid Beigy | Sharif university of technology | December 25, 2018 63 / 65
Deep learning | Reading
Hamid Beigy | Sharif university of technology | December 25, 2018 64 / 65
Deep learning | Reading
Hamid Beigy | Sharif university of technology | December 25, 2018 65 / 65