Spring 2018 CIS 693, EEC 693, EEC 793:
Autonomous Intelligent Robotics
Instructor: Shiqi Zhang
http://eecs.csuohio.edu/~szhang/teaching/18spring/
Autonomous Intelligent Robotics Instructor: Shiqi Zhang - - PowerPoint PPT Presentation
Spring 2018 CIS 693, EEC 693, EEC 793: Autonomous Intelligent Robotics Instructor: Shiqi Zhang http://eecs.csuohio.edu/~szhang/teaching/18spring/ Reinforcement Learning Adapted form Peter Bodk Previous Lectures l Supervised learning l
http://eecs.csuohio.edu/~szhang/teaching/18spring/
Adapted form Peter Bodík
l classifjcation, regression
l clustering, dimensionality reduction
l generalization of supervised learning l learn from interaction w/ environment to achieve a goal
environment agent action reward new state
l solving an MDP using Dynamic Programming
l Monte Carlo methods l Temporal-Difgerence learning
l state representation l function approximation, rewards
START
actions: UP , DOWN, LEFT , RIGHT UP 80% move UP 10% move LEFT 10% move RIGHT
l is reward “10” good or bad? l rewards could be delayed
l not just blind search, try to be smart about it
l solving an MDP using Dynamic Programming
l Monte Carlo methods l Temporal-Difgerence learning
l state representation l function approximation, rewards
START
actions: UP , DOWN, LEFT , RIGHT UP 80% move UP 10% move LEFT 10% move RIGHT
reward +1 at [4,3], -1 at [4,2] reward -0.04 for each step
l not in this case (actions are stochastic)
l mapping from each state to an action
l P( [1,2] | [1,1], up ) = 0.8 l Markov assumption
l r( [4,3] ) = +1
l π(s) or π(s,a)
l transitions and rewards usually not available l how to change the policy based on experience l how to explore the environment
environment agent action reward new state
l “game over” after N steps l optimal policy depends on N; harder to analyze
l V(s0, s1, …) = r(s0) + r(s1) + r(s2) + … l infjnite value for continuing tasks
l V(s0, s1, …) = r(s0) + γ*r(s1) + γ2*r(s2) + … l value bounded if rewards bounded
l expected return when starting in s and following π
l expected return when starting in s, performing a, and following π
l can estimate from experience l pick the best action using Qπ(s,a)
s a s’ r
l Vπ defjnes partial ordering on policies l they share the same optimal value function
l system of n non-linear equations l solve for V*(s) l easy to extract the optimal policy
s a s’ r
l solving an MDP using Dynamic Programming
l Monte Carlo methods l Temporal-Difgerence learning
l state representation l function approximation, rewards
l use value functions to structure the search for good policies l need a perfect model of the environment
l policy evaluation: compute Vπ from π l policy improvement: improve π based on Vπ l start with an arbitrary policy l repeat evaluation/improvement until convergence
l Bellman eqn’s defjne a system of n eqn’s l could solve, but will use iterative version l start with an arbitrary value function V0, iterate until Vk converges
l π’ either strictly better than π, or π’ is optimal (if π = π’)
l two nested iterations; too slow l don’t need to converge to Vπk l just move towards it
l use Bellman optimality equation as an update l converges to V*
l robot in a room l state space, action space, transition model
l robot in a room? l back gammon? l helicopter?
l updates estimates on the basis of other estimates
l solving an MDP using Dynamic Programming
l Monte Carlo methods l Temporal-Difgerence learning
l state representation l function approximation, rewards
l just experience, or l simulated experience
l defjned only for episodic tasks
l policy evaluation, policy improvement
= expected return starting from s and following π l estimate as average of observed returns in state s
l average returns following the fjrst visit to state s
s0 s s +1
+1
+5 R1(s) = +2 s0 s0 s0 s0 s0 R2(s) = +1 R3(s) = -5 R4(s) = +4
Vπ(s) ≈ (2 + 1 – 5 + 4)/4 = 0.5
l need exact model of environment
l update after each episode
l greedy policy won’t explore all actions
l don’t know anything about the environment at the beginning l need to try all actions to fjnd the optimal one
l use soft policies instead: π(s,a)>0 (for all s,a)
l with probability 1-ε perform the optimal/greedy action l with probability ε perform a random action l will keep exploring the environment l slowly move it towards greedy policy: ε -> 0
l s0: A♣, A♦, 6♠, A♥, 2♠ l a0: discard 6♠, 2♠ l s1: A♣, A♦, A♥, A♠, 9♠ + dealer takes 4 cards l return: +1 (probably)
l list all states, actions, compute P(s,a,s’) l P( [A♣,A♦,6♠,A♥,2♠], [6♠,2♠], [A♠,9♠,4] ) = 0.00192
l all you need are sample episodes l let MC play against a random policy, or itself, or another algorithm
l averaging of sample returns l only for episodic tasks
l sample episodes l simulated experience
l don’t need a full sweep
l less harmed by violation of Markov property
l use soft policies
l solving an MDP using Dynamic Programming
l Monte Carlo methods l Temporal-Difgerence learning
l state representation l function approximation, rewards
l like MC: learn directly from experience (don’t need a model) l like DP: bootstrap l works for continuous tasks, usually faster then MC
l have to wait until the end of episode to update
l update after every step, based on the successor target
A – 0, B – 0 B – 1 B – 1 B - 1 B – 1 B – 1 B – 1 B – 0
l converges to values that minimize the error on training data
l converges to estimate l of the Markov process A B
r = 0 100% r = 1 75% r = 0 25%
l start with a random policy l update Q and π after each step l again, need ε-soft policies
at
at+1 at+2
l start with a random policy, iteratively improve l converge to optimal
l use any policy to estimate Q l Q directly approximates Q* (Bellman optimality eqn) l independent of the policy being followed l only requirement: keep updating each (s,a) pair
l solving an MDP using Dynamic Programming
l Monte Carlo methods l Temporal-Difgerence learning
l state representation l function approximation, rewards
l need to explore to estimate efgects of actions l would take too long in this case
l input: workload, system confjguration l output: performance under this workload l also model transients: how long it takes to move data
l can efgiciently search for best actions l move smallest amount of data to handle workload
l solving an MDP using Dynamic Programming
l Monte Carlo methods l Temporal-Difgerence learning
l state representation l function approximation, rewards
l move car left/right to keep the pole balanced
l position and velocity of car l angle and angular velocity of pole
l would need more info l noise in sensors, temperature, bending of pole
l coarse discretization of 4 state variables l left, center, right l totally non-Markov, but still works
l linear regression, decision tree, neural net, … l linear regression:
l better generalization l fewer parameters and updates afgect “similar” states as well
l treat as one data point for regression l want method that can learn on-line (update after each step) x y
l learn the best discretization during training
l start with a single state l split a state when difgerent parts of that state have difgerent values
l start with many states l merge states with similar values
l episodic task, not discounted, +1 when out, 0 for each step
l GOOD: +1 for winning, -1 losing l BAD: +0.25 for taking opponent’s pieces l high reward even when lose
l rewards indicate what we want to accomplish l NOT how we want to accomplish it
l positive reward often very “far away” l rewards for achieving subgoals (domain knowledge) l also: adjust initial policy or initial value function
l use when need to make decisions in uncertain environment l actions have delayed efgect
l dynamic programming l need complete model l Monte Carlo l time difgerence learning (Sarsa, Q-learning)
l designing features, state representation, rewards