Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY - - PowerPoint PPT Presentation

breakout group reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY - - PowerPoint PPT Presentation

Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY OF O XFORD ) String_Data 2017, Boston 12/01/2017 Outline Theoretical introduction ( 30 minutes ) Discussion of code ( 30 minutes ) Solve version of grid world with


slide-1
SLIDE 1

Breakout Group Reinforcement Learning

FABIAN RUEHLE (UNIVERSITY OF OXFORD) String_Data 2017, Boston 12/01/2017

slide-2
SLIDE 2
  • Theoretical introduction (30 minutes)
  • Discussion of code (30 minutes)
  • Solve version of grid world with SARSA
  • Discussion of RL and its applications to String Theory

(30 minutes)

Outline

slide-3
SLIDE 3
  • Supervised Learning (SL):
  • provide a set of training tuples
  • after training, machine predicts
  • Unsupervised Learning (UL):
  • only provide set training input set
  • give task to machine (e.g. cluster input) without telling it how to do this

exactly

  • After training, the machine will perform self-learned action on
  • Reinforcement Learning (RL):
  • in between SL and UL
  • Machine acts autonomously, but actions are reinforced / punished

How to teach a machine

[(in0, out0), (in1, out1), . . . , (inn, outn)]

  • uti from ini

[in0, in1, . . . , inn]

ini

slide-4
SLIDE 4

Theoretical introduction

slide-5
SLIDE 5
  • Basic textbooks/literature
  • The “thing that learns” is called agent or worker
  • The “thing that is explored” is called environment
  • The “elements of the environment” are called states or observations
  • The “things that take you from one state to another” are called actions
  • The “thing that tells you how to select the next action” is called policy
  • Actions are executed sequentially in a sequence called (time) steps
  • The “reinforcement” the agent experiences is called reward
  • The “accumulated reward” is called return
  • In RL, an agent performs actions in an environment with the goal to

maximize its long-term return

Reinforcement Learning - Vocabulary

[Barton, Sutton ’98 ‘17]

slide-6
SLIDE 6
  • We focus on discrete state and action spaces
  • State space
  • Action space
  • total:
  • for :
  • Policy : Select next action for given state
  • Reward : Reward for taking action in state 


Reinforcement Learning - Details

S = {states in environment} A = {actions to transition between states}

s ∈ S

π

π : S 7! A

π(s) = a , a ∈ A(s)

R(s, a) ∈ R a s R : S ⇥ A 7! R

A(s) = {possible actions in state s}

slide-7
SLIDE 7
  • Return: The accumulated reward from current step

  • State value function : Expected return for with

policy :

  • Action value function : Expected return for

performing action in state with policy :

  • Prediction problem: Given , predict or
  • Control problem: Find optimal policy that

maximizes or

Reinforcement Learning - Details

t

γ ∈ (0, 1]

s q(s, a) vπ(s)

π

π

s a qπ(s, a) = E[ Gt | s = st, a = at] vπ(s) = E[ Gt | s = st]

π

vπ(s) qπ(s, a) π∗ vπ(s) qπ(s, a)

Gt =

X

k=0

γkrt+k+1 ,

slide-8
SLIDE 8
  • Commonly used policies:
  • greedy: Choose the action that maximizes the action

value function:

  • - greedy: Explore different possibilities
  • We take -greedy policy improvement
  • On-policy: Update policy you are following (e.g. always -

greedy)

  • Off-policy: Use different policy for choosing next action

and updating

Reinforcement Learning - Details

π0(s) = argmax q(s, a)

ε

⇡0(s) = ⇢ Choose greedy in (1 − ") cases Choose random action in ✏ cases

ε ε at+1

q(st, at)

slide-9
SLIDE 9
  • Solving the control problem:
  • : Learning rate ( means no update to )
  • One step approximation:
  • Similar for action value function:
  • Update depends on tuple
  • is currently best known action for state
  • Note: SARSA is on-policy

Reinforcement Learning - SARSA

Gt = r + γv(st+1) α α = 0 v(st) = α[r + γq(st+1, at+1) − q(st, at))] ∆q(st, at) = α[Gt − q(st, at)] ∆v(st) = α[Gt − v(st)] (st, at, r, st+1, at+1) at+1 st+1

slide-10
SLIDE 10
  • Very similar to SARSA
  • Difference in update:
  • SARSA:
  • Q_Learning:
  • Note: This means that Q-Learning is off-policy
  • SARSA is found to perform better
  • Q-Learning is proven to converge to solution
  • Combine with (deep NNs): Deep Q-Learning

Reinforcement Learning - Q-Learning

∆q(st, at) = α[r + γ maxa0 q(st+1, a0) − q(st, at)]

∆q(st, at) = α[r + γq(st+1, at+1) − q(st, at)]

slide-11
SLIDE 11

Example - Gridworld

Worker (“Explorer”) Wall Exit Pitfall

slide-12
SLIDE 12
  • We will look at a version of grid world:
  • Gridworld is a grid-like maze with walls, pitfalls, and an exit
  • Each state is a point on the grid of the maze
  • The actions are
  • Goal: Find the exit (strongly rewarded)
  • Each step is punished mildly (solve maze quickly)
  • Pitfalls should be avoided (strongly punished)
  • Running into a wall does not change the state

Example - Gridworld

A = {up, down, left, right}

slide-13
SLIDE 13
  • Walls = Boundaries of landscape (negative number of

branes)

  • Empty square = Consistent point in the landscape

which does not correspond to our Universe

  • Pitfalls = Mathematically / Physically inconsistent

states (anomalies, tadpoles, …)

  • Exit = Standard Model of Particle Physics

Gridworld vs String Landscape

slide-14
SLIDE 14

Coding

slide-15
SLIDE 15

Discussion