Reinforcement Learning Course 9.54 Final review Agent learning to - PowerPoint PPT Presentation

Reinforcement Learning Course 9.54 Final review

Agent learning to act in an unknown environment

Reinforcement Learning Setup S t

Background and setup • The environment is initially unknown or partially known • It is also stochastic, the agent cannot fully predict what will happen next • What is a ‘good’ action to select under these conditions? • Animals learning seeks to maximize their reward

Formal Setup • The agent is in one of a set of states , {S 1 , S 2 ,… S n } • At each state, it can take an action from a set of available actions {A 1 , A 2 ,… A k } • From state S i taking action A j – > a new state S j and a possible reward

Stochastic transitions S 2 A 1 S 3 R=2 A 2 S 2 S 1 S 3 A 3 R= -1 S 1

The consequence of an action: • (S,A) → (S ' , R) • Governed by: • P(S' | S, A) • P(R | S, A, S') • These probabilities are properties of the world. (‘Contingencies’) • An assumption that the transitions are Markovian

Policy • The goal is to learn a policy π: S → A • The policy determines the future of the agent: π π π S 1 A 3 S 2 S 3 A 1 A 2

Model- based vs. Model- free RL • Model- based methods assume that the probabilities: – P(S' | S, A) – P(R | S, A, S') are known and can be used in the planning • In model- free methods: – The ‘contingencies’ are not known – Need to be learned by exploration as a part of policy learning

Step 1 defining V π (S) S S(1) S(2) a 1 a 2 r 1 r 2 • Start from S and just follow the policy π • We find ourselves in state S(1) and reward r 1 etc. • V π (S) = < r 1 + γ r 2 + γ 2 r 3 + … > • The expected (discounted) reward from S on.

Step 2: equations for V(S) • V π (S) = < r 1 + γ r 2 + γ 2 r 3 + … > • = V π (S) = < r 1 + γ (r 2 + γ r 3 + … ) > • = < r 1 + γ V(S') > • These are equations relating V(S) for different states. • Next write the explicitly in terms of the known parameters (contingencies):

Equations for V(S) S 1 r 1 r 2 A S 2 S r 3 S 3 • V π (S) = < r 1 + γ V π (S') > = • E.g.:    p(S' | S, a) [r(S, a, S' ) V (S' ) ] V π (S) =  ' S V π (S) = [ 0.2 (r 1 + γ V π (S 1 )) + 0.5 (r 2 + γ V π (S 2 )) + 0.3 (r 3 + γ V π (S 3 )) ] • Linear equations, the unknowns are V π (S i )

Improving the Policy • Given the policy π, we can find the values V π (S) by solving the linear equations iteratively • Convergence is guaranteed (the system of equations is strongly diagonally dominant) • Given V(S), we can improve the policy: • We can combine these steps to find the optimal policy

Improving the policy A π S 1 r 1 r 2 A’ S 2 S r 3 S 3

Value Iteration learning V and π when the ‘contingencies’ are known:

Value Iteration Algorithm * Value iteration is used in the problem set

Q-learning • The main algorithm used for model-free RL

Q-values (state-action) S 2 A 1 S 3 R=2 Q(S 1, A 1 ) A 2 S 2 S 1 Q(S 1, A 3 ) S 3 A 3 R= -1 S 1 Q π (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π

Q-value (state-action) • The same update is done on Q-values rather than on V • Used in most practical algorithms and some brain models • Q π (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π :

Q-values (state-action) S 2 A 1 S 3 Q(S 1, A 1 ) R=2 A 2 S 2 S 1 Q(S 1, A 3 ) S 3 A 3 R= -1 S 1

SARSA • It is called SARSA because it uses: s(t), a(t), r(t+1), s(t+1), a(t+1) • A step like this uses the current π, so that each S has its action according to the policy π: a = π(S)

SARSA RL Algorithm * ԑ -greedy: with probability ԑ , do not select the greedy action. Instead select with equal probability among all actions.

TD learning Biology: dompanine

Behavioral support for ‘prediction error’ Associating light cue with food

‘Blocking’ • No response to the bell • The bell and food were consistently associated • There was no prediction error. • Conclusion : prediction error, not association, drives learning !

Learning and Dopamine • Learning is driven by the prediction error: δ(t ) = r + γV(S’)) – V(S) • Computed by the dopamine system (Here too, if there is no error, no learning will take place)

Dopaminergic neurons • Dopamine is a neuro-modulator • In the: – VTA (ventral tegmental area) – Substantia Nigra • These neurons send their axons to brain structures involved in motivation and goal- directed behavior, for example, the striatum, nucleus accumbens, and frontal cortex.

Major players in RL

Effects of dopamine, why it is associated with reward and reward related learning • Drugs like amphetamine and cocaine exert their addictive actions in part by prolonging the influence of dopamine on target neurons • Second, neural pathways associated with dopamine neurons are among the best targets for electrical self- stimulation: – Animals treated with dopamine receptor blockers learn less rapidly to press a bar for a reward pellet

Dopamine and prediction error • The animal (rat, monkey) gets a cue (visual, or auditory). • A reward after a delay (1 sec below)

Dopamine and prediction error

Reinforcement Learning Course 9.54 Final review Agent learning to - PowerPoint PPT Presentation

Reinforcement Learning Course 9.54 Final review Agent learning to act in an unknown environment Reinforcement Learning Setup S t Background and setup The environment is initially unknown or partially known It is also stochastic, the

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Rebecca C. Thurston, PhD Director, Womens Biobehavioral Health Research Program University of

Beyond Retrospectives Linda Rising linda@lindarising.org www.lindarising.org Call for insights !

CoLAboraTive Observational Research ( TRANSLATOR ) CTSA OneHealth Alliance (COHA) standards based

Teaching Agile Software Development Martin Kropp, FHNW Andreas Meier, ZHAW ECSS 2013,

Reinforcement Learning Models of the Basal Ganglia Computational Models of Neural Systems

Pathways Housing First! Program Philosophy, Operations, and Effectiveness

Bonner April 7, 2020 Participant Poll 1991. Scott Nearing: An Intellectual Biography,

UPMC Nursing Preceptor Academy Winter 2012 Mission The mission of the UPMC Preceptor Academy is