Reinforcement Learning CE417: Introduction to Artificial - PowerPoint PPT Presentation

Reinforcement Learning CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019 Soleymani Slides have been adopted from Klein and Abdeel, CS188, UC Berkeley.

Reinforcement Learning 2

Recap: MDPs } Markov decision processes: } States S } Actions A } Transitions P(s’|s,a) (or T(s,a,s’)) s } Rewards R(s,a,s’) (and discount g ) a } Start state s 0 s, a } Quantities: s,a,s ’ } Policy = map of states to actions s ’ } Utility = sum of discounted rewards } Values = expected future utility from a state (max node) } Q-Values = expected future utility from a q-state (chance node) 3

Reinforcement Learning } Still assume a Markov decision process (MDP): } A set of states s Î S } A set of actions (per state) A } A model T(s,a,s’) } A reward function R(s,a,s’) } Still looking for a policy p (s) } New twist: don’t know T or R } I.e. we don’t know which states are good or what the actions do } Must actually try actions and states out to learn 4

Reinforcement Learning Agent State: s Actions: a Reward: r Environmen t } Basic idea: } Receive feedback in the form of rewards } Agent’s utility is defined by the reward function } Must (learn to) act so as to maximize expected rewards } All learning is based on observed samples of outcomes! 5

The Crawler! 6

Video of Demo Crawler Bot 7

Double Bandits 8

Let’s Play! $2 $2 $0 $2 $2 $2 $2 $0 $0 $0 9

What Just Happened? } That wasn’t planning, it was learning! } Specifically, reinforcement learning } There was an MDP , but you couldn’t solve it with just computation } You needed to actually act to figure it out } Important ideas in reinforcement learning that came up } Exploration: you have to try unknown actions to get information } Exploitation: eventually, you have to use what you know } Regret: even if you learn intelligently, you make mistakes } Sampling: because of chance, you have to try things repeatedly } Difficulty: learning can be much harder than solving a known MDP 10

Offline (MDPs) vs. Online (RL) Offline Solution Online Learning 11

Model-Based Learning 12

Model-Based Learning } Model-Based Idea: } Learn an approximate model based on experiences } Solve for values as if the learned model were correct } Step 1: Learn empirical MDP model } Count outcomes s’ for each s, a } Normalize to give an estimate of } Discover each when we experience (s, a, s’) } Step 2: Solve the learned MDP } For example, use value iteration, as before 13

Example: Model-Based Learning Input Policy p Observed Episodes (Training) Learned Model Episode 1 Episode 2 T(s,a,s’). T(B, east, C) = 1.00 B, east, C, -1 B, east, C, -1 A T(C, east, D) = 0.75 C, east, D, -1 C, east, D, -1 T(C, east, A) = 0.25 D, exit, x, +10 D, exit, x, +10 … B C D R(s,a,s’). Episode 3 Episode 4 E R(B, east, C) = -1 E, north, C, -1 E, north, C, -1 R(C, east, D) = -1 C, east, D, -1 C, east, A, -1 R(D, exit, x) = +10 Assume: g = 1 D, exit, x, +10 A, exit, x, -10 … 14

Example: Expected Age Goal: Compute expected age of cs188 students Known P(A) Without P(A), instead collect samples [a 1 , a 2 , … a N ] Unknown P(A): “Model Based” Unknown P(A): “Model Free” Why does this Why does this work? Because work? Because eventually you samples appear learn the right with the right model. frequencies. 15

Model-Free Learning 16

Reinforcement Learning } We still assume an MDP: } A set of states s Î S } A set of actions (per state) A } A model T(s,a,s’) } A reward function R(s,a,s’) } Still looking for a policy p (s) } New twist: don’t know T or R, so must try out actions } Big idea: Compute all averages over T using sample outcomes 17

Passive Reinforcement Learning 18

Passive Reinforcement Learning } Simplified task: policy evaluation } Input: a fixed policy p (s) } You don’t know the transitionsT(s,a,s’) } You don’t know the rewards R(s,a,s’) } Goal: learn the state values } In this case: } Learner is “along for the ride” } No choice about what actions to take } Just execute the policy and learn from experience } This is NOT offline planning! You actually take actions in the world. 19

Direct Evaluation } Goal: Compute values for each state under p } Idea:Average together observed sample values } Act according to p } Every time you visit a state, write down what the sum of discounted rewards turned out to be } Average those samples } This is called direct evaluation 20

Example: Direct Evaluation Input Policy p Observed Episodes (Training) Output Values Episode 1 Episode 2 -10 B, east, C, -1 B, east, C, -1 A A C, east, D, -1 C, east, D, -1 D, exit, x, +10 D, exit, x, +10 +8 +4 +10 B C D B C D Episode 3 Episode 4 -2 E E E, north, C, -1 E, north, C, -1 C, east, D, -1 C, east, A, -1 Assume: g = 1 D, exit, x, +10 A, exit, x, -10 21

Problems with Direct Evaluation Output Values } What’s good about direct evaluation? } It’s easy to understand -10 A } It doesn’t require any knowledge of T, R } It eventually computes the correct average values, +8 +4 +10 B C using just sample transitions D -2 E } What bad about it? } It wastes information about state connections If B and E both go to C } Each state must be learned separately under this policy, how can their values be different? } So, it takes a long time to learn 22

Why Not Use Policy Evaluation? } Simplified Bellman updates calculate V for a fixed policy: s Each round, replace V with a one-step-look-ahead layer over V } p (s) s, p (s) s, p (s),s ’ s ’ This approach fully exploited the connections between the states } Unfortunately, we need T and R to do it! } } Key question: how can we do this update to V without knowing T and R? In other words, how to we take a weighted average without knowing the weights? } 23

Sample-Based Policy Evaluation? } We want to improve our estimate of V by computing these averages: } Idea: Take samples of outcomes s’ (by doing the action!) and average s p (s) s, p (s) s, p (s),s’ s s s s 1 2 3 ' ' ' ' Almost! But we can’t rewind time to get sample after sample from state s. 24

Temporal Difference Learning 25

Temporal Difference Learning } Big idea: learn from every experience! s Update V(s) each time we experience a transition (s, a, s’, r) } p (s) Likely outcomes s’ will contribute updates more often } s, p (s) } Temporal difference learning of values Policy still fixed, still doing evaluation! } s’ Move values toward value of whatever successor occurs: running average } Sample of V(s): Update to V(s): Same update: 26

Exponential Moving Average } Exponential moving average } The running interpolation update: } Makes recent samples more important: } Forgets about the past (distant past values were wrong anyway) } Decreasing learning rate (alpha) can give converging averages 27

Example: Temporal Difference Learning States Observed Transitions B, east, C, -2 C, east, D, -2 A 0 0 0 B C D 0 0 -1 0 -1 3 8 8 8 E 0 0 0 Assume: g = 1, α = 1/2 28

Problems with TD Value Learning } TD value leaning is a model-free way to do policy evaluation, mimicking Bellman updates with running sample averages } However, if we want to turn values into a (new) policy, we’re sunk: s a } Idea: learn Q-values, not values s, a s,a,s ’ s ’ } Makes action selection model-free too! 29

Detour: Q-Value Iteration } Value iteration: find successive (depth-limited) values Start with V 0 (s) = 0, which we know is right } Given V k , calculate the depth k+1 values for all states: } } But Q-values are more useful, so compute them instead Start with Q 0 (s,a) = 0, which we know is right } Given Q k , calculate the depth k+1 q-values for all q-states: } 30

Active Reinforcement Learning 31

Active Reinforcement Learning } Full reinforcement learning: optimal policies (like value iteration) } You don’t know the transitionsT(s,a,s’) } You don’t know the rewards R(s,a,s’) } You choose the actions now } Goal: learn the optimal policy / values } In this case: } Learner makes choices! } Fundamental tradeoff: exploration vs. exploitation } This is NOT offline planning! You actually take actions in the world and find out what happens… 32

Q-Learning } We’d like to do Q-value updates to each Q-state: But can’t compute this update without knowingT, R } } Instead, compute average as we go Receive a sample transition (s,a,r,s’) } This sample suggests } But we want to average over results from (s,a) (Why?) } So keep a running average } 33

Q-Learning } Learn Q(s,a) values as you go } Receive a sample (s,a,s’,r) } Consider your old estimate: } Consider your new sample estimate: } Incorporate the new estimate into a running average: 34

Video of Demo Q-Learning -- Gridworld 35

Video of Demo Q-Learning -- Crawler 36

Q-Learning Properties } Amazing result: Q-learning converges to optimal policy -- even if you’re acting suboptimally! } This is called off-policy learning } Caveats: You have to explore enough } You have to eventually make the learning rate small enough } … but not decrease it too quickly } Basically, in the limit, it doesn’t matter how you select actions (!) } 37

Reinforcement Learning CE417: Introduction to Artificial - PowerPoint PPT Presentation

Reinforcement Learning CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019 Soleymani Slides have been adopted from Klein and Abdeel, CS188, UC Berkeley. Reinforcement Learning 2 Recap: MDPs } Markov

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

PERSPECTVES ON DRUG PRICING: Is Value - Based Pricing the Answer? Uwe Reinhardt Princeton

How are free markets negative? 12/17/19 What is the worst/ most evil company or business you can

Roadmap for Section 1.3. High-level Overview on Windows Concepts Processes, Threads Virtual

Section 17 Section 17 ADSP-BF533 VisualDSP++ C/C++ Compiler a 17-1 1 Strategic Objective:

Global IPv6 statistics Measuring the current state of IPv6 for ordinary users Steinar H.

CS 343H: Honors AI Lecture 14: Reinforcement Learning, part 3 3/3/2014 Kristen Grauman UT

Measurement of the strong coupling constant by CMS Juska Pekkanen on behalf of the CMS

Generic capture-avoiding substitution James Cheney Binding Challenges workshop April 24, 2005 1