[PPT] - Autonomous Intelligent Robotics Instructor: Shiqi Zhang PowerPoint Presentation

SLIDE 1

Spring 2018 CIS 693, EEC 693, EEC 793:

Autonomous Intelligent Robotics

Instructor: Shiqi Zhang

http://eecs.csuohio.edu/~szhang/teaching/18spring/

SLIDE 2

Reinforcement Learning

Adapted form Peter Bodík

SLIDE 3

Previous Lectures

l Supervised learning

l classifjcation, regression

l Unsupervised learning

l clustering, dimensionality reduction

l Reinforcement learning

l generalization of supervised learning l learn from interaction w/ environment to achieve a goal

environment agent action reward new state

SLIDE 4

Today

l examples l defjning a Markov Decision Process

l solving an MDP using Dynamic Programming

l Reinforcement Learning

l Monte Carlo methods l Temporal-Difgerence learning

l automatic resource allocation for in-memory database l miscellaneous

l state representation l function approximation, rewards

SLIDE 5

Robot in a room

+1

1

START

actions: UP , DOWN, LEFT , RIGHT UP 80% move UP 10% move LEFT 10% move RIGHT

l reward +1 at [4,3], -1 at [4,2] l reward -0.04 for each step l what’s the strategy to achieve max reward? l what if the actions were deterministic?

SLIDE 6

Other examples

l pole-balancing l walking robot (applet) l TD-Gammon [Gerry Tesauro] l helicopter [Andrew Ng] l no teacher who would say “good” or “bad”

l is reward “10” good or bad? l rewards could be delayed

l explore the environment and learn from the experience

l not just blind search, try to be smart about it

SLIDE 7

Outline

l examples l defjning a Markov Decision Process

l solving an MDP using Dynamic Programming

l Reinforcement Learning

l Monte Carlo methods l Temporal-Difgerence learning

l miscellaneous

l state representation l function approximation, rewards

SLIDE 8

Robot in a room

+1

1

START

actions: UP , DOWN, LEFT , RIGHT UP 80% move UP 10% move LEFT 10% move RIGHT

reward +1 at [4,3], -1 at [4,2] reward -0.04 for each step

l states l actions l rewards l what is the solution?

SLIDE 9

Is this a solution?

+1

1

l only if actions deterministic

l not in this case (actions are stochastic)

l solution/policy

l mapping from each state to an action

SLIDE 10

Optimal policy

+1

1

SLIDE 11

Reward for each step -2

+1

1

SLIDE 12

Reward for each step: -0.1

+1

1

SLIDE 13

Reward for each step: -0.04

+1

1

SLIDE 14

Reward for each step: -0.01

+1

1

SLIDE 15

Reward for each step: +0.01

+1

1

SLIDE 16

Markov Decision Process (MDP)

l set of states S, set of actions A, initial state S0 l transition model P(s’|s,a)

l P( [1,2] | [1,1], up ) = 0.8 l Markov assumption

l reward function r(s)

l r( [4,3] ) = +1

l goal: maximize cumulative reward in the long run l policy: mapping from S to A

l π(s) or π(s,a)

l reinforcement learning

l transitions and rewards usually not available l how to change the policy based on experience l how to explore the environment

environment agent action reward new state

SLIDE 17

Computing return from rewards

l episodic (vs. continuing) tasks

l “game over” after N steps l optimal policy depends on N; harder to analyze

l additive rewards

l V(s0, s1, …) = r(s0) + r(s1) + r(s2) + … l infjnite value for continuing tasks

l discounted rewards

l V(s0, s1, …) = r(s0) + γ*r(s1) + γ2*r(s2) + … l value bounded if rewards bounded

SLIDE 18

Value functions

l state value function: Vπ(s)

l expected return when starting in s and following π

l state-action value function: Qπ(s,a)

l expected return when starting in s, performing a, and following π

l useful for fjnding the optimal policy

l can estimate from experience l pick the best action using Qπ(s,a)

l Bellman equation

s a s’ r

SLIDE 19

Optimal value functions

l there’s a set of optimal policies

l Vπ defjnes partial ordering on policies l they share the same optimal value function

l Bellman optimality equation

l system of n non-linear equations l solve for V*(s) l easy to extract the optimal policy

l having Q*(s,a) makes it even simpler

s a s’ r

SLIDE 20

Outline

l examples l defjning a Markov Decision Process

l solving an MDP using Dynamic Programming

l Reinforcement Learning

l Monte Carlo methods l Temporal-Difgerence learning

l miscellaneous

l state representation l function approximation, rewards

SLIDE 21

Dynamic programming

l main idea

l use value functions to structure the search for good policies l need a perfect model of the environment

l two main components

l policy evaluation: compute Vπ from π l policy improvement: improve π based on Vπ l start with an arbitrary policy l repeat evaluation/improvement until convergence

SLIDE 22

Policy evaluation/improvement

l policy evaluation: π -> Vπ

l Bellman eqn’s defjne a system of n eqn’s l could solve, but will use iterative version l start with an arbitrary value function V0, iterate until Vk converges

l policy improvement: Vπ -> π’

l π’ either strictly better than π, or π’ is optimal (if π = π’)

SLIDE 23

Policy/Value iteration

l Policy iteration

l two nested iterations; too slow l don’t need to converge to Vπk l just move towards it

l Value iteration

l use Bellman optimality equation as an update l converges to V*

SLIDE 24

Using DP

l need complete model of the environment and rewards

l robot in a room l state space, action space, transition model

l can we use DP to solve

l robot in a room? l back gammon? l helicopter?

l DP bootstraps

l updates estimates on the basis of other estimates

SLIDE 25

Outline

l defjning a Markov Decision Process

l solving an MDP using Dynamic Programming

l Reinforcement Learning

l Monte Carlo methods l Temporal-Difgerence learning

l miscellaneous

l state representation l function approximation, rewards

SLIDE 26

Monte Carlo methods

l don’t need full knowledge of environment

l just experience, or l simulated experience

l averaging sample returns

l defjned only for episodic tasks

l but similar to DP

l policy evaluation, policy improvement

SLIDE 27

Monte Carlo policy evaluation

l want to estimate Vπ(s)

= expected return starting from s and following π l estimate as average of observed returns in state s

l fjrst-visit MC

l average returns following the fjrst visit to state s

s0 s s +1

2

+1

3

+5 R1(s) = +2 s0 s0 s0 s0 s0 R2(s) = +1 R3(s) = -5 R4(s) = +4

Vπ(s) ≈ (2 + 1 – 5 + 4)/4 = 0.5

SLIDE 28

Monte Carlo control

l Vπ not enough for policy improvement

l need exact model of environment

l estimate Qπ(s,a) l MC control

l update after each episode

l non-stationary environment l a problem

l greedy policy won’t explore all actions

SLIDE 29

Maintaining exploration

l key ingredient of RL l deterministic/greedy policy won’t explore all actions

l don’t know anything about the environment at the beginning l need to try all actions to fjnd the optimal one

l maintain exploration

l use soft policies instead: π(s,a)>0 (for all s,a)

l ε-greedy policy

l with probability 1-ε perform the optimal/greedy action l with probability ε perform a random action l will keep exploring the environment l slowly move it towards greedy policy: ε -> 0

SLIDE 30

Simulated experience

l 5-card draw poker

l s0: A♣, A♦, 6♠, A♥, 2♠ l a0: discard 6♠, 2♠ l s1: A♣, A♦, A♥, A♠, 9♠ + dealer takes 4 cards l return: +1 (probably)

l DP

l list all states, actions, compute P(s,a,s’) l P( [A♣,A♦,6♠,A♥,2♠], [6♠,2♠], [A♠,9♠,4] ) = 0.00192

l MC

l all you need are sample episodes l let MC play against a random policy, or itself, or another algorithm

SLIDE 31

Summary of Monte Carlo

l don’t need model of environment

l averaging of sample returns l only for episodic tasks

l learn from:

l sample episodes l simulated experience

l can concentrate on “important” states

l don’t need a full sweep

l no bootstrapping

l less harmed by violation of Markov property

l need to maintain exploration

l use soft policies

SLIDE 32

Outline

l defjning a Markov Decision Process

l solving an MDP using Dynamic Programming

l Reinforcement Learning

l Monte Carlo methods l Temporal-Difgerence learning

l miscellaneous

l state representation l function approximation, rewards

SLIDE 33

Temporal Difgerence Learning

l combines ideas from MC and DP

l like MC: learn directly from experience (don’t need a model) l like DP: bootstrap l works for continuous tasks, usually faster then MC

l constant-alpha MC:

l have to wait until the end of episode to update

l simplest TD

l update after every step, based on the successor target

SLIDE 34

MC vs. TD

l observed the following 8 episodes:

A – 0, B – 0 B – 1 B – 1 B - 1 B – 1 B – 1 B – 1 B – 0

l MC and TD agree on V(B) = 3/4 l MC: V(A) = 0

l converges to values that minimize the error on training data

l TD: V(A) = 3/4

l converges to estimate l of the Markov process A B

r = 0 100% r = 1 75% r = 0 25%

SLIDE 35

Sarsa

l again, need Q(s,a), not just V(s) l control

l start with a random policy l update Q and π after each step l again, need ε-soft policies

st st+1

at

st+2

at+1 at+2

rt rt+1

SLIDE 36

Q-learning

l previous algorithms: on-policy algorithms

l start with a random policy, iteratively improve l converge to optimal

l Q-learning: ofg-policy

l use any policy to estimate Q l Q directly approximates Q* (Bellman optimality eqn) l independent of the policy being followed l only requirement: keep updating each (s,a) pair

l Sarsa

SLIDE 37

Outline

l defjning a Markov Decision Process

l solving an MDP using Dynamic Programming

l Reinforcement Learning

l Monte Carlo methods l Temporal-Difgerence learning

l miscellaneous

l state representation l function approximation, rewards

SLIDE 38

Model-based approach

l Classical RL is model-free

l need to explore to estimate efgects of actions l would take too long in this case

l Model of the system:

l input: workload, system confjguration l output: performance under this workload l also model transients: how long it takes to move data

l Policy can estimate the efgects of difgerent actions:

l can efgiciently search for best actions l move smallest amount of data to handle workload

SLIDE 39

Outline

l examples l defjning a Markov Decision Process

l solving an MDP using Dynamic Programming

l Reinforcement Learning

l Monte Carlo methods l Temporal-Difgerence learning

l miscellaneous

l state representation l function approximation, rewards

SLIDE 40

State representation

l pole-balancing

l move car left/right to keep the pole balanced

l state representation

l position and velocity of car l angle and angular velocity of pole

l what about Markov property?

l would need more info l noise in sensors, temperature, bending of pole

l solution

l coarse discretization of 4 state variables l left, center, right l totally non-Markov, but still works

SLIDE 41

Function approximation

l until now, state space small and discrete l represent Vt as a parameterized function

l linear regression, decision tree, neural net, … l linear regression:

l update parameters instead of entries in a table

l better generalization l fewer parameters and updates afgect “similar” states as well

l TD update

l treat as one data point for regression l want method that can learn on-line (update after each step) x y

SLIDE 42

Splitting and aggregation

l want to discretize the state space

l learn the best discretization during training

l splitting of state space

l start with a single state l split a state when difgerent parts of that state have difgerent values

l state aggregation

l start with many states l merge states with similar values

SLIDE 43

Designing rewards

l robot in a maze

l episodic task, not discounted, +1 when out, 0 for each step

l chess

l GOOD: +1 for winning, -1 losing l BAD: +0.25 for taking opponent’s pieces l high reward even when lose

l rewards

l rewards indicate what we want to accomplish l NOT how we want to accomplish it

l shaping

l positive reward often very “far away” l rewards for achieving subgoals (domain knowledge) l also: adjust initial policy or initial value function

SLIDE 44

Summary

l Reinforcement learning

l use when need to make decisions in uncertain environment l actions have delayed efgect

l solution methods

l dynamic programming l need complete model l Monte Carlo l time difgerence learning (Sarsa, Q-learning)

l simple algorithms l most work

l designing features, state representation, rewards