Reinforcement Learning Course 9.54 Final review Agent learning to - - PowerPoint PPT Presentation
Reinforcement Learning Course 9.54 Final review Agent learning to - - PowerPoint PPT Presentation
Reinforcement Learning Course 9.54 Final review Agent learning to act in an unknown environment Reinforcement Learning Setup S t Background and setup The environment is initially unknown or partially known It is also stochastic, the
Agent learning to act in an unknown environment
Reinforcement Learning Setup
St
Background and setup
- The environment is initially unknown or partially known
- It is also stochastic, the agent cannot fully predict what will
happen next
- What is a ‘good’ action to select under these conditions?
- Animals learning seeks to maximize their reward
Formal Setup
- The agent is in one of a set of states, {S1, S2,…Sn }
- At each state, it can take an action from a set of available
actions {A1, A2,…Ak}
- From state Si taking action Aj –> a new state Sj and a possible
reward
A1 A2 S1 A3 S2 S3 S1 S3 S2
R=2 R= -1
Stochastic transitions
The consequence of an action:
- (S,A) → (S', R)
- Governed by:
- P(S' | S, A)
- P(R | S, A, S')
- These probabilities are properties of the world. (‘Contingencies’)
- An assumption that the transitions are Markovian
Policy
- The goal is to learn a policy π: S → A
- The policy determines the future of the agent:
S1 S2 S3 A2 A1 A3 π π π
Model-based vs. Model-free RL
- Model-based methods assume that the probabilities:
– P(S' | S, A) – P(R | S, A, S')
are known and can be used in the planning
- In model-free methods:
– The ‘contingencies’ are not known – Need to be learned by exploration as a part of policy learning
S S(1) S(2) a1 a2 r1 r2
Step 1 defining Vπ(S)
- Start from S and just follow the policy π
- We find ourselves in state S(1) and reward r1 etc.
- Vπ(S) = < r1 + γ r2 + γ2 r3 + … >
- The expected (discounted) reward from S on.
Step 2: equations for V(S)
- Vπ(S) = < r1 + γ r2 + γ2 r3 + … >
- = Vπ(S) = < r1 + γ (r2 + γ r3 + … ) >
- = < r1 + γ V(S') >
- These are equations relating V(S) for different states.
- Next write the explicitly in terms of the known parameters
(contingencies):
A S1 S S3 S2 r1 r3 r2
Equations for V(S)
- Vπ(S) = < r1 + γ Vπ (S') > =
- E.g.:
Vπ(S) = Vπ(S) = [ 0.2 (r1 + γVπ (S1)) + 0.5 (r2 + γVπ (S2)) + 0.3 (r3 + γVπ (S3)) ]
- Linear equations, the unknowns are Vπ (Si)
'
] ) (S' V ) S' a, [r(S, a) S, | p(S'
S
Improving the Policy
- Given the policy π, we can find the values Vπ(S) by solving the
linear equations iteratively
- Convergence is guaranteed (the system of equations is
strongly diagonally dominant)
- Given V(S), we can improve the policy:
- We can combine these steps to find the optimal policy
A’ S1 S S3 S2 r1 r3 r2 A π
Improving the policy
Value Iteration
learning V and π when the ‘contingencies’ are known:
Value Iteration Algorithm
* Value iteration is used in the problem set
Q-learning
- The main algorithm used for model-free RL
Q-values (state-action)
A1 A2 S1 A3 S2 S3 S1 S3 S2
R=2 R= -1
Q(S1, A1 ) Q(S1, A3 ) Qπ (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π
Q-value (state-action)
- The same update is done on Q-values rather than on V
- Used in most practical algorithms and some brain
models
- Qπ (S,a) is the expected return starting from S, taking
the action a, and thereafter following policy π:
A1 A2 S1 A3 S2 S3 S1 S3 S2
R=2 R= -1
Q-values (state-action)
Q(S1, A1 ) Q(S1, A3 )
SARSA
- It is called SARSA because it uses:
s(t), a(t), r(t+1), s(t+1), a(t+1)
- A step like this uses the current π, so that each S has its
action according to the policy π: a = π(S)
SARSA RL Algorithm
* ԑ-greedy: with probability ԑ , do not select the greedy action. Instead select with equal probability among all actions.
TD learning Biology: dompanine
Behavioral support for ‘prediction error’
Associating light cue with food
‘Blocking’
- No response to the bell
- The bell and food were consistently associated
- There was no prediction error.
- Conclusion: prediction error, not association, drives learning !
Learning and Dopamine
- Learning is driven by the prediction error:
δ(t) = r + γV(S’)) – V(S)
- Computed by the dopamine system
(Here too, if there is no error, no learning will take place)
Dopaminergic neurons
- Dopamine is a neuro-modulator
- In the:
– VTA (ventral tegmental area) – Substantia Nigra
- These neurons send their axons to brain
structures involved in motivation and goal- directed behavior, for example, the striatum, nucleus accumbens, and frontal cortex.
Major players in RL
Effects of dopamine, why it is associated with reward and reward related learning
- Drugs like amphetamine and cocaine exert their
addictive actions in part by prolonging the influence of dopamine on target neurons
- Second, neural pathways associated
with dopamine neurons are among the best targets for electrical self- stimulation:
– Animals treated with dopamine receptor blockers learn less rapidly to press a bar for a reward pellet
Dopamine and prediction error
- The animal (rat, monkey) gets a cue (visual, or auditory).
- A reward after a delay (1 sec below)