SLIDE 1
Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open - - PowerPoint PPT Presentation
Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open - - PowerPoint PPT Presentation
Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open Source Project Presentation 19/11/2019 Q-Learning Q-Learning is model-free reinforcement learning. Q is the action-value function defining the reward used for reinforcement
SLIDE 2
SLIDE 3
Q-Learning: Bellman Error
The value of Qπ at a certain point in time, t, in terms of the payoff from an initial choice, at, and the value of the remaining decision problem that results after that choice.
J(π) = Eα hQπ(st, at) max
a
E[rt + γQπ(st+, a)] i
α ! s ⇠ ρ, at ⇠ π(·|st)
SLIDE 4
Deep Q-Networks
Briefly Approximate the action-value function Qπ(s, a) with a neural network Qθ(s, a). The (greedy) policy represented by this is πθ. Discretise the expectation using K sample trajectories, each with period T. Use this to approximate J(θ).
˜ J(θ) =
K T K
P
i= T
P
t=
(Q(i)
θ (s(i) t , a(i) t )) max a
h rt + γQ(i)
θ (s(i) t+, a)
i )
SLIDE 5
Variational Inference
Main Concepts:
- 1. Try to solve an optimisation problem over a class of tractable
distributions, q, parameterised by φ, in order to find the one most similar to p.
- 2. φ min
φ KL(qφ(θ) k p(θ|D))
- 3. Approximate this using gradient descent.
SLIDE 6
Variational Deep Q-Networks
Idea: For efficient exploration we need qφ(θ) to be dispersed — near even coverage of the parameter space. Encourage this by adding an entropy bonus to the objective.
Eθ⇠qφ(θ) h Qθ(sj, aj) max
a0 E[rj + γQθ(s0 j, a0)]
i λH(qφ(θ))
Assigning systematic randomness to Q enables efficient exploration
- f the policy space. Further, encouraging high entropy over
parameter distribution prevent premature convergence. tl;dr Higher chance of finding maximal rewards in a faster time than standard DQNs.
SLIDE 7
Algorithm
Figure: VDQN Pseudocode.
SLIDE 8
Aim / Goals Workplan
SLIDE 9