Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY - - PowerPoint PPT Presentation
Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY - - PowerPoint PPT Presentation
Breakout Group Reinforcement Learning F ABIAN R UEHLE (U NIVERSITY OF O XFORD ) String_Data 2017, Boston 12/01/2017 Outline Theoretical introduction ( 30 minutes ) Discussion of code ( 30 minutes ) Solve version of grid world with
- Theoretical introduction (30 minutes)
- Discussion of code (30 minutes)
- Solve version of grid world with SARSA
- Discussion of RL and its applications to String Theory
(30 minutes)
Outline
- Supervised Learning (SL):
- provide a set of training tuples
- after training, machine predicts
- Unsupervised Learning (UL):
- only provide set training input set
- give task to machine (e.g. cluster input) without telling it how to do this
exactly
- After training, the machine will perform self-learned action on
- Reinforcement Learning (RL):
- in between SL and UL
- Machine acts autonomously, but actions are reinforced / punished
How to teach a machine
[(in0, out0), (in1, out1), . . . , (inn, outn)]
- uti from ini
[in0, in1, . . . , inn]
ini
Theoretical introduction
- Basic textbooks/literature
- The “thing that learns” is called agent or worker
- The “thing that is explored” is called environment
- The “elements of the environment” are called states or observations
- The “things that take you from one state to another” are called actions
- The “thing that tells you how to select the next action” is called policy
- Actions are executed sequentially in a sequence called (time) steps
- The “reinforcement” the agent experiences is called reward
- The “accumulated reward” is called return
- In RL, an agent performs actions in an environment with the goal to
maximize its long-term return
Reinforcement Learning - Vocabulary
[Barton, Sutton ’98 ‘17]
- We focus on discrete state and action spaces
- State space
- Action space
- total:
- for :
- Policy : Select next action for given state
- Reward : Reward for taking action in state
Reinforcement Learning - Details
S = {states in environment} A = {actions to transition between states}
s ∈ S
π
π : S 7! A
π(s) = a , a ∈ A(s)
R(s, a) ∈ R a s R : S ⇥ A 7! R
A(s) = {possible actions in state s}
- Return: The accumulated reward from current step
- State value function : Expected return for with
policy :
- Action value function : Expected return for
performing action in state with policy :
- Prediction problem: Given , predict or
- Control problem: Find optimal policy that
maximizes or
Reinforcement Learning - Details
t
γ ∈ (0, 1]
s q(s, a) vπ(s)
π
π
s a qπ(s, a) = E[ Gt | s = st, a = at] vπ(s) = E[ Gt | s = st]
π
vπ(s) qπ(s, a) π∗ vπ(s) qπ(s, a)
Gt =
∞
X
k=0
γkrt+k+1 ,
- Commonly used policies:
- greedy: Choose the action that maximizes the action
value function:
- - greedy: Explore different possibilities
- We take -greedy policy improvement
- On-policy: Update policy you are following (e.g. always -
greedy)
- Off-policy: Use different policy for choosing next action
and updating
Reinforcement Learning - Details
π0(s) = argmax q(s, a)
ε
⇡0(s) = ⇢ Choose greedy in (1 − ") cases Choose random action in ✏ cases
ε ε at+1
q(st, at)
- Solving the control problem:
- : Learning rate ( means no update to )
- One step approximation:
- Similar for action value function:
- Update depends on tuple
- is currently best known action for state
- Note: SARSA is on-policy
Reinforcement Learning - SARSA
Gt = r + γv(st+1) α α = 0 v(st) = α[r + γq(st+1, at+1) − q(st, at))] ∆q(st, at) = α[Gt − q(st, at)] ∆v(st) = α[Gt − v(st)] (st, at, r, st+1, at+1) at+1 st+1
- Very similar to SARSA
- Difference in update:
- SARSA:
- Q_Learning:
- Note: This means that Q-Learning is off-policy
- SARSA is found to perform better
- Q-Learning is proven to converge to solution
- Combine with (deep NNs): Deep Q-Learning
Reinforcement Learning - Q-Learning
∆q(st, at) = α[r + γ maxa0 q(st+1, a0) − q(st, at)]
∆q(st, at) = α[r + γq(st+1, at+1) − q(st, at)]
Example - Gridworld
Worker (“Explorer”) Wall Exit Pitfall
- We will look at a version of grid world:
- Gridworld is a grid-like maze with walls, pitfalls, and an exit
- Each state is a point on the grid of the maze
- The actions are
- Goal: Find the exit (strongly rewarded)
- Each step is punished mildly (solve maze quickly)
- Pitfalls should be avoided (strongly punished)
- Running into a wall does not change the state
Example - Gridworld
A = {up, down, left, right}
- Walls = Boundaries of landscape (negative number of
branes)
- Empty square = Consistent point in the landscape
which does not correspond to our Universe
- Pitfalls = Mathematically / Physically inconsistent
states (anomalies, tadpoles, …)
- Exit = Standard Model of Particle Physics