Discrete and Continuous Reinforcement Learning (not part of exam - PowerPoint PPT Presentation

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Brief Overview of Discrete and Continuous Reinforcement Learning (not part of exam material) 1 1

ADVANCED MACHINE LEARNING Forms of Learning • Supervised learning – where the algorithm learns a function or model that maps best a set of inputs to a set of desired outputs. • Reinforcement learning – where the algorithm learns a policy or model of the set of transitions across a discrete set of input- output states (Markovian world) in order to maximize a reward value (external reinforcement). • Unsupervised learning – where the algorithm learns a model that best represent a set of inputs without any feedback (no desired output, no external reinforcement) 2 2 2

ADVANCED MACHINE LEARNING Example of RL Learning how to stand-up Morimoto and Doya, Robotics and Autonomous Systems , 2001 3 3 3

ADVANCED MACHINE LEARNING Reinforcement learning: Sequential Decision Problem Problem: Search a mapping from state to action s a   f s a t t Task: Get rock samples Feedback: success or failure It is up to the robot to figure out the best solution ! What are the rewards ? Let’s try everything Exploration: Have to try and explore multiple solutions to find the best! 4 4 4

ADVANCED MACHINE LEARNING Supervised / semi-supervised learning Problem: Search a mapping from state to action s a   f s a t t In supervised learning, at each time step provide pairs:         , , , , , ,... , s a s a s a s a 1 1 2 2 3 3 T T In semi-supervised learning, provide partial supervision           , , , ? , , ? , , ,... , s a s s s a s a 1 1 2 3 4 4 T T The set of state-action pairs provided for training are optim al (expert tea cher) 5 5 5

ADVANCED MACHINE LEARNING Example of RL bootstrapped with supervised learning Learning how to swing up a pendulum Atkeson & Schaal, Intern. Conf. on Machine Learning, ICML 1997; Schaal, NIPS, 1997 6 6 6

ADVANCED MACHINE LEARNING Reinforcement learning & supervised learning The expert provides some examples of optimal state-action pai rs and of the associated reward. Expert provides some sequences of optimal state-action pairs ( roll-outs ) and the reward:           , , , , , , , , , , , ,..., , , s a r s a r s a r s a r s a r 1 1 1 2 2 2 3 3 3 4 4 4 T T T The a gent searc hes for the solu tion by generatin g new action-sta te pairs in a neightbourhood around the expert's demonstrations. These solutions are not necessarily optimal . The expert provides a reward for thes e roll-outs.           , , , , , , , , , , , ,..., , , s a r s a r s a r s a r s a r 1 1 1 2 2 2 3 3 3 4 4 4 T T T 7 7 7

ADVANCED MACHINE LEARNING The Reward The reward shapes the learning. Choosing it well is crucial for success of the learning. Imagine that you want to train a robot to learn to walk. • What reward would you choose for training a robot to stand-up? • What is the dimension of the state-action space? • How long would it take to learn through trial and error? • How is the reward helping reduce this number? UC Berkeley, Darwin Robot 8 8 8

ADVANCED MACHINE LEARNING The Reward One could choose a more complex reward (informative): Reward = penalty for deviation of center of mass from equilibrium point + reward for cyclic motion of left and right leg + reward for in-phase motion of upper and lower leg, etc. Reduce the search of the state-action space by looking for phase- relationships between the joints. Unconstrained search over joint angles Constrained search for torso motion and relative leg motion 9 9 9

ADVANCED MACHINE LEARNING RL: Optimality Reinforcement learning when using discrete state-action spaces with finite horizon can be solved in an optimal manner. We will see next how this can be done. This is no longer true for generic continuous state-action spaces. However, the same principles can be extended to continuous worlds, albeit with loss of optimality principle. (note that you can also guarantee optimality in continuous state and action space but some assumptions have to be made, e.g. Gaussian noise and linear control policy) 10 10 10

ADVANCED MACHINE LEARNING RL: Discrete State Fire pit Goal Reward Agent Set of possible states in the world (environment + agent) 225 states in the above example (not all shown above) 11 11 11

ADVANCED MACHINE LEARNING RL: Discrete State Fire pit Goal    A policy , is used to choose an action s a from any state . a s RL le arns an optimal policy . Agent A set of possible actions of the agent: 12 12 12

ADVANCED MACHINE LEARNING RL: Discrete Actions Fire pit    Illustration of a policy , . s a Stochastic Environment : Transitions across states are not    deterministic | , p s a s  1 t t t Rewards may also be stochastic. Agent Knowing requires a model of the world. p It can be learned while learning the policy. 13 13 13

ADVANCED MACHINE LEARNING RL: the effect of the environment Deterministic Stochastic RL takes into account stochasticity of the environment 14 14 14

ADVANCED MACHINE LEARNING RL: the effect of the environment RL assumes that the world is first-order Marko v      | , , , ,... , | , p s a s a s a s p s a s     1 1 1 0 0 1 t t t t t t t t In words: the probability of a transition to a new state (and new reward) depends only on the current state and action, not on the history of previous states and actions. If state and action sets are finite, it is a finite MDP. This assumptions reduces drastically computation. No need to propagate the probabilities. Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 15 15

ADVANCED MACHINE LEARNING RL: the policy At each time step, the agent being in a state s t    chooses an action by drawing from , a s a t     If , equiprobable for all actions , s a a pick at random a    - Otherwise pick best action argmax , s a a Gree dy Policy Example of greedy policy ending up in a limit cycle when using a poor policy. The agent must be able to measure how well it is doing and use this to update its policy. A good policy maximizes the expected reward. 16 16 16

ADVANCED MACHINE LEARNING RL: exercise I 17 17 17

ADVANCED MACHINE LEARNING RL: Value function Value function Reward The state -value function gives for each state an estimate of the expected reward starting          from that state: ( ) V s E r s s    1 t k t    0 k It depends on the agent’s polic y. 19 19 19

ADVANCED MACHINE LEARNING RL: Value function Value function Policy Greedy Policy 20 20 20

ADVANCED MACHINE LEARNING RL: Value function Discount future rewards:             k  ( ) , 0 1. V s E r s s    1 t k t    0 k    shortsighted 0 1 farsight ed 21 21 21

ADVANCED MACHINE LEARNING RL: Markov Decision Process (MDP) Fire pit MDP Goal Agent How to find the best possible policy ? Find optimal value function => gives optimal policy 22 22 22

ADVANCED MACHINE LEARNING RL: How to find an optimal policy ? Find the value function Need policy to compute the expectation Exploit recursive property: Bellman equation 23 23 23

ADVANCED MACHINE LEARNING RL: Bellman Equation The Bellman equation is a recursive equation describing MDPs: t  r t  1   r t  2   2 r t  3   3 r R t  4    r t  1   r t  2   r t  3   2 r t  4  r t  1   R t  1   So:    ( ) V s E R s s  t t         E r V s s s Bellman Equation    1 1 t t t Or, without the expectation operator (assuming a MDP):            a a ( ) ( , ) ( ) V s s a P R V s   s s s s  a s Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24 24 24

ADVANCED MACHINE LEARNING RL: Bellman Policy Evaluation V  ( s )  ( a , ) s S a P  s s a r a R  S’ s s V  ( s ' )   So:    ( ) V s E R s s  t t         E r V s s s Bellman Equation    1 1 t t t Or, without the expectation operator (assuming a MDP):            a a ( ) ( , ) ( ) V s s a P R V s   s s s s  a s Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25 25 25

Discrete and Continuous Reinforcement Learning (not part of exam - PowerPoint PPT Presentation

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Brief Overview of Discrete and Continuous Reinforcement Learning (not part of exam material) 1 1 ADVANCED MACHINE LEARNING Forms of Learning Supervised learning where the algorithm

CMSC 222: Discrete Mathematics Prof S Fall 2018 What is Discrete Mathematics? Discrete

Plan Discrete paths as Heyting algebras Discrete paths as categories Discrete paths as quantales

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Discrete Mathematics Jeremy Siek Spring 2010 Jeremy Siek Discrete Mathematics 1 / 118 Jeremy

Cyber-Physical Systems Discrete Dynamics IECE 553/453 Fall 2019 Prof. Dola Saha 1 Discrete

Cyber-Physical Systems Discrete Dynamics ICEN 553/453 Fall 2018 Prof. Dola Saha 1 Discrete

Discrete-time Systems in the Time Domain Chaiwoot Boonyasiriwat August 21, 2020 Discrete-time

Discrete Structures of Computer Science Amanda Watson What is Discrete Mathematics?

Sampling Theory The world is continuous Like it or not, images are discrete. Intro to

Sampling Theory The world is continuous Like it or not, images are discrete. Intro to

Overview Motivation Verifying Continuous-Time Markov Chains 1 Lecture 1+2: Discrete-Time Markov

continuous random variables continuous random variables Discrete random variable: takes values in

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Continuous Improvement Continuous Improvement Update on Continuous Improvement Process Update on

Logic and discrete mathematics (HKGAB4) Discrete mathematics: contents http://www.ida.liu.se/

Discrete distributions Probabilities and statistics for biology (CMB STAT1 - STAT2) Jacques van

First PARIS experiments in the ENSAR2 TNA facilities Adam Maj IFJ PAN Krakw PARIS desing

KaliVeda Toolkit C o r e d e v e l o p me n t t e a m: J o h n F r a n k

Solar Powering Your Community Addressing Soft Costs and Barriers SunShot Solar Outreach

FEMA s Hazus Earthquake Model Na Natural tural Ha Hazar ards ds Ri Risk sk As

Navigating a New World Post COVID-19 Recovery Stuart Chaplin, Deputy Chair, IFAC Professional

( ) Intro. On Artificial Intelligence from the perspective of probability

CLASSI C CLASSI C CLASSI C Modelling , Specification , and Verification using UPPAAL Kim

MM 713 Aqueous Corrosion and Its Control V.S. RAJA Institute Chair Professor Department of