variational deep q networks in edward
play

Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open - PowerPoint PPT Presentation

Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open Source Project Presentation 19/11/2019 Q-Learning Q-Learning is model-free reinforcement learning. Q is the action-value function defining the reward used for reinforcement


  1. Variational Deep Q-Networks in Edward Harri Bell-Thomas R244: Open Source Project Presentation 19/11/2019

  2. Q-Learning Q-Learning is model-free reinforcement learning. Q is the action-value function defining the reward used for reinforcement — this is learned. Conceptually, h P 1 i t =  r t γ t | s  = s, a  = a Q π ( s, a ) = E a t ⇠ π ( ·| s t )

  3. Q-Learning: Bellman Error The value of Q π at a certain point in time, t , in terms of the payo ff from an initial choice, a t , and the value of the remaining decision problem that results after that choice. h� Q π ( s t , a t ) � max �  i E [ r t + γ Q π ( s t +  , a )] J ( π ) = E α a α ! s  ⇠ ρ , a t ⇠ π ( ·| s t )

  4. Deep Q-Networks Briefly Approximate the action-value function Q π ( s, a ) with a neural network Q θ ( s, a ). The (greedy) policy represented by this is π θ . Discretise the expectation using K sample trajectories, each with period T . Use this to approximate J ( θ ). K T �  h i ˜ ( Q ( i ) θ ( s ( i ) t , a ( i ) r t + γ Q ( i ) θ ( s ( i ) J ( θ ) =   )  P P t )) � max t +  , a ) K T a i =  t = 

  5. Variational Inference Main Concepts: 1. Try to solve an optimisation problem over a class of tractable distributions, q , parameterised by φ , in order to find the one most similar to p . 2. φ min φ KL ( q φ ( θ ) k p ( θ | D )) 3. Approximate this using gradient descent.

  6. Variational Deep Q-Networks Idea: For e ffi cient exploration we need q φ ( θ ) to be dispersed — near even coverage of the parameter space. Encourage this by adding an entropy bonus to the objective. h� �  i Q θ ( s j , a j ) � max a 0 E [ r j + γ Q θ ( s 0 j , a 0 )] � λ H ( q φ ( θ )) E θ ⇠ q φ ( θ ) Assigning systematic randomness to Q enables e ffi cient exploration of the policy space. Further, encouraging high entropy over parameter distribution prevent premature convergence. tl;dr Higher chance of finding maximal rewards in a faster time than standard DQNs.

  7. Algorithm Figure: VDQN Pseudocode.

  8. Aim / Goals Workplan

  9. Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend