Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open - - PowerPoint PPT Presentation

variational deep q networks in edward
SMART_READER_LITE
LIVE PREVIEW

Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open - - PowerPoint PPT Presentation

Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open Source Project Presentation 19/11/2019 Q-Learning Q-Learning is model-free reinforcement learning. Q is the action-value function defining the reward used for reinforcement


slide-1
SLIDE 1

Variational Deep Q-Networks in Edward

Harri Bell-Thomas

R244: Open Source Project Presentation

19/11/2019

slide-2
SLIDE 2

Q-Learning

Q-Learning is model-free reinforcement learning. Q is the action-value function defining the reward used for reinforcement — this is learned. Conceptually,

Qπ(s, a) = Eat⇠π(·|st) h P1

t= rtγt | s = s, a = a

i

slide-3
SLIDE 3

Q-Learning: Bellman Error

The value of Qπ at a certain point in time, t, in terms of the payoff from an initial choice, at, and the value of the remaining decision problem that results after that choice.

J(π) = Eα hQπ(st, at) max

a

E[rt + γQπ(st+, a)] i

α ! s ⇠ ρ, at ⇠ π(·|st)

slide-4
SLIDE 4

Deep Q-Networks

Briefly Approximate the action-value function Qπ(s, a) with a neural network Qθ(s, a). The (greedy) policy represented by this is πθ. Discretise the expectation using K sample trajectories, each with period T. Use this to approximate J(θ).

˜ J(θ) = 

K  T K

P

i= T

P

t=

(Q(i)

θ (s(i) t , a(i) t )) max a

h rt + γQ(i)

θ (s(i) t+, a)

i )

slide-5
SLIDE 5

Variational Inference

Main Concepts:

  • 1. Try to solve an optimisation problem over a class of tractable

distributions, q, parameterised by φ, in order to find the one most similar to p.

  • 2. φ min

φ KL(qφ(θ) k p(θ|D))

  • 3. Approximate this using gradient descent.
slide-6
SLIDE 6

Variational Deep Q-Networks

Idea: For efficient exploration we need qφ(θ) to be dispersed — near even coverage of the parameter space. Encourage this by adding an entropy bonus to the objective.

Eθ⇠qφ(θ) h Qθ(sj, aj) max

a0 E[rj + γQθ(s0 j, a0)]

i λH(qφ(θ))

Assigning systematic randomness to Q enables efficient exploration

  • f the policy space. Further, encouraging high entropy over

parameter distribution prevent premature convergence. tl;dr Higher chance of finding maximal rewards in a faster time than standard DQNs.

slide-7
SLIDE 7

Algorithm

Figure: VDQN Pseudocode.

slide-8
SLIDE 8

Aim / Goals Workplan

slide-9
SLIDE 9

Questions?