breakout group reinforcement learning
play

Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY - PowerPoint PPT Presentation

Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY OF O XFORD ) String_Data 2017, Boston 12/01/2017 Outline Theoretical introduction ( 30 minutes ) Discussion of code ( 30 minutes ) Solve version of grid world with


  1. Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY OF O XFORD ) String_Data 2017, Boston 12/01/2017

  2. Outline ‣ Theoretical introduction ( 30 minutes ) ‣ Discussion of code ( 30 minutes ) • Solve version of grid world with SARSA ‣ Discussion of RL and its applications to String Theory ( 30 minutes )

  3. How to teach a machine ‣ Supervised Learning (SL): • provide a set of training tuples [(in 0 , out 0 ) , (in 1 , out 1 ) , . . . , (in n , out n )] • after training, machine predicts out i from in i ‣ Unsupervised Learning (UL): • only provide set training input set [in 0 , in 1 , . . . , in n ] • give task to machine (e.g. cluster input) without telling it how to do this exactly • After training, the machine will perform self-learned action on in i ‣ Reinforcement Learning (RL): • in between SL and UL • Machine acts autonomously, but actions are reinforced / punished

  4. Theoretical introduction

  5. Reinforcement Learning - Vocabulary ‣ Basic textbooks/literature [Barton, Sutton ’98 ‘17] ‣ The “thing that learns” is called agent or worker ‣ The “thing that is explored” is called environment ‣ The “elements of the environment” are called states or observations ‣ The “things that take you from one state to another” are called actions ‣ The “thing that tells you how to select the next action” is called policy ‣ Actions are executed sequentially in a sequence called (time) steps ‣ The “reinforcement” the agent experiences is called reward ‣ The “accumulated reward” is called return ‣ In RL, an agent performs actions in an environment with the goal to maximize its long-term return

  6. Reinforcement Learning - Details ‣ We focus on discrete state and action spaces ‣ State space S = { states in environment } ‣ Action space • total: A = { actions to transition between states } • for : s ∈ S A ( s ) = { possible actions in state s } ‣ Policy : Select next action for given state π : S 7! A π π ( s ) = a , a ∈ A ( s ) ‣ Reward : Reward for taking action in state 
 R ( s, a ) ∈ R a s R : S ⇥ A 7! R

  7. Reinforcement Learning - Details ‣ Return: The accumulated reward from current step 
 t ∞ X γ k r t + k +1 , γ ∈ (0 , 1] G t = k =0 ‣ State value function : Expected return for with v π ( s ) s v π ( s ) = E [ G t | s = s t ] policy : π ‣ Action value function : Expected return for q ( s, a ) performing action in state with policy : s π a q π ( s, a ) = E [ G t | s = s t , a = a t ] ‣ Prediction problem: Given , predict or v π ( s ) q π ( s, a ) π ‣ Control problem: Find optimal policy that π ∗ q π ( s, a ) maximizes or v π ( s )

  8. Reinforcement Learning - Details ‣ Commonly used policies: • greedy: Choose the action that maximizes the action π 0 ( s ) = argmax q ( s, a ) value function: • - greedy: Explore different possibilities ε ⇢ Choose greedy in (1 − " ) cases ⇡ 0 ( s ) = Choose random action in ✏ cases ‣ We take -greedy policy improvement ε ‣ On-policy: Update policy you are following (e.g. always - ε greedy) ‣ Off-policy: Use different policy for choosing next action a t +1 and updating q ( s t , a t )

  9. Reinforcement Learning - SARSA ‣ Solving the control problem: ∆ v ( s t ) = α [ G t − v ( s t )] v ( s t ) • : Learning rate ( means no update to ) α = 0 α G t = r + γ v ( s t +1 ) • One step approximation: ‣ Similar for action value function: ∆ q ( s t , a t ) = α [ G t − q ( s t , a t )] = α [ r + γ q ( s t +1 , a t +1 ) − q ( s t , a t ))] ( s t , a t , r, s t +1 , a t +1 ) • Update depends on tuple • is currently best known action for state a t +1 s t +1 ‣ Note: SARSA is on-policy

  10. Reinforcement Learning - Q-Learning ‣ Very similar to SARSA ‣ Difference in update: • SARSA: ∆ q ( s t , a t ) = α [ r + γ q ( s t +1 , a t +1 ) − q ( s t , a t )] • Q_Learning: ∆ q ( s t , a t ) = α [ r + γ max a 0 q ( s t +1 , a 0 ) − q ( s t , a t )] ‣ Note: This means that Q-Learning is off-policy ‣ SARSA is found to perform better ‣ Q-Learning is proven to converge to solution ‣ Combine with (deep NNs): Deep Q-Learning

  11. Example - Gridworld Worker (“Explorer”) Pitfall Exit Wall

  12. Example - Gridworld ‣ We will look at a version of grid world: • Gridworld is a grid-like maze with walls, pitfalls, and an exit • Each state is a point on the grid of the maze A = { up, down, left, right } • The actions are • Goal: Find the exit (strongly rewarded) • Each step is punished mildly (solve maze quickly) • Pitfalls should be avoided (strongly punished) • Running into a wall does not change the state

  13. Gridworld vs String Landscape ‣ Walls = Boundaries of landscape (negative number of branes) ‣ Empty square = Consistent point in the landscape which does not correspond to our Universe ‣ Pitfalls = Mathematically / Physically inconsistent states (anomalies, tadpoles, …) ‣ Exit = Standard Model of Particle Physics

  14. Coding

  15. Discussion

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend