DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm –8:50pm R Zoom Lecture Fall 2020

Last Lecture v What is reinforcement learning? v Difference from other AI problems v Application stories. v Topics to be covered in this course. v Course logistics 2

Reinforcement Learning What is it? Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment to maximize some notion of cumulative reward. 1. Model 2. Value function 3. Policy (From Wikipedia)

RL involves 4 key aspects RL involves 4 key aspects 1. Optimization. 1. Optimization. 2. Exploration. 2. Exploration. v Goal is to find an optimal way v Goal is to find an optimal way to make decisions, with to make decisions, with maximized total cumulated maximized total cumulated rewards rewards 4. Delayed consequences 4. Delayed consequences 2. Generalization. 2. Generalization. v Programming v Programming all possibilities all possibilities is not possible. is not possible. $5 $20 $5 $20 4 28

Branches of Machine Learning AI planning Supervised Unsupervised Learning Learning Machine Learning Reinforcement Learning Imitation learning From David Silver’s Slides

Today’s topics v Reinforcement Learning Components § Model, Value function, Policy v Model-based Planning § Policy Evaluation, Policy Search v Project 1 demo and description.

Today’s topics v Reinforcement Learning Components § State vs observation § Stochastic vs deterministic model and policy § Model, Value function, Policy v Model-based Planning § Policy Evaluation, Policy Search v Project 1 demo and description.

Reinforcement Learning Components Observation Action Reward Environment

Agent-Environment interactions over time (sequential decision process) Observation Action o t a t Each time step t : 1. Agent takes an action a t ; 2. World updates given action Reward a t , emits observation o t and reward r t ; r t 3. Agent receives observation ot and reward r t . Environment

Interaction history, Decision-making Observation Action o t a t Reward r t Environment History h t = ( a 1 , o 1 , r 1 , ..., a t , o t , r t ) Agent chooses action a t+1 based on history h t State: information assumed to determine what happens next as a function of history: s t = f ( h t ), In many cases, for simplicity, s t = o t

State transition & Markov property Observation/State Action s t =o t a t Reward r t Environment Transition Probability p(s t+1 |s t ,a t ) State s t is Markov if and only if: p(s t+1 |s t , a t ) = p(s t+1 |h t , a t ) Future is independent of past, given present.

A taxi driver seeks for Hypertension control Passengers: State (observation): State: (Current location, (current blood pressure) with or without passenger) Action: A direction to go Action: take medication or not Path 1 Path 2 Path 3

More on Markov Property ? 1. Does Markov Property always hold? 1. No 2. What if Markov Property does not hold?

More on Markov Property ? 1. Does Markov Property always hold? 1. No 2. What if Markov Property does not hold? 1. Make it Markov by setting state as the history: s t = h t Again, in practice , we often assume most recent observation is sufficient statistic of history: s t = o t State representation has big implications for: 1. Computational complexity 2. Data required 3. Resulting performance

Fully vs Partially Observable Markov Decision Process What you observe partially What you observe fully represent the environment represents the environment state state. s t = h t s t = o t

Breakout game Poker games

Deterministic vs Stochastic Model Stochastic: Given history Deterministic: Given & action, many potential history & action, single observations & rewards observation & reward Common assumption for Common assumption in customers, patients, hard to robotics and controls model domains p(s t+1 | s t , a t ) =1, s t+1 =s 0≤ p(s t+1 | s t , a t ) < 1 p(s t+1 | s t , a t ) =0, s t+1 ≠s P[r(s t , a t ) =3]=50%, r(s t , a t ) =3, s t =s, a t =a P[r(s t , a t ) =5]=50%, s t =s, a t =a

Breakout game Hypertension control For both transition and reward

Example: Taxi passenger-seeking task as a decision-making process s 6 s 5 s 3 s 2 s 4 s 1 States: Locations of taxi ( s 1 , . . . , s 6 ) on the road Actions: Left or Right Rewards: +1 in state s 1 +3 in state s 5 0 in all other states

RL components v Often include one or more of § Model: Representation of how the world changes in response to agent’s action § Policy: function mapping agent’s states to action § Value function: Future rewards from being in a state and/or action when following a particular policy

RL components: Model v Agent’s representation of how the world changes in response to agent’s action, with two parts: Transition model Reward model predicts next agent state predicts immediate reward p(s t+1 = s’ | s t = s , a t = a )

Taxi passenger-seeking task Stochastic Markov Model s 6 s 5 s 3 s 2 s 4 s 1 r’ 1 =0 r’ 2 =0 r’ 3 =0 r’ 4 =0 r’ 5 =0 r’ 6 =0 Taxi agent’s transition model: 0.5 = p(s 3 |s 3 , right) = p(s 4 |s 3 , right) 0.5 = p(s 4 |s 4 , right) = P(s 5 |s 4 , right) Numbers above show RL agent’s reward model , which may be wrong. Ture reward model is r =[1,0,0,0,3,0]

RL components: Policy v Policy π determines how the agent chooses actions § π : S → A, mapping from states to actions a s v Deterministic policy: a’ § π ( s ) = a § In the other word, a’’ a • π (a| s ) = 0 , • π (a’| s ) = π (a’’| s )= 0, a’ s v Stochastic policy: § π ( a | s ) = Pr( a t = a | s t = s ) a’’

Taxi passenger-seeking task Policy s 6 s 5 s 3 s 2 s 4 s 1 50% 50% Action set: {left, right} Policy presented by arrow. Q1: Is this a deterministic or stochastic policy? Q2: Give an example of another policy type?

RL components: Value Function v Value function V π : expected discounted sum of future rewards under a particular policy π v Discount factor γ weighs immediate vs future rewards, with γ in [0,1]. v Can be used to quantify goodness/badness of states and actions v And decide how to act by comparing policies a s a’

Taxi passenger-seeking task: Value function s 6 s 5 s 3 s 2 s 4 s 1 Discount factor , γ = 0 Policy #1: π(s 1 ) = π(s 2 ) = ··· = π(s 6 ) = right Q: V π ? Policy #2: π(left| s i ) = π(right| s i ) = 50%, for i=1,…,6 Q: V π ?

Types of RL agents/algorithms Model-Free: Model-based Explicit: Value function Explicit: Model and/or policy function May or may not have policy No model and/or value function

Today’s topics v Reinforcement Learning Components § Model, Value function, Policy v Model-based Planning v MDP model § Policy Evaluation, Policy Search v Project 1 demo and description.

MDP v Markov Decision Process

Transition Model Reward Model Policy function Value function

Taxi passenger-seeking task: Transition Model Reward Model Policy function MDP Value function s 6 s 5 s 3 s 2 s 4 s 1 a2 a1 deterministic transition model

Transition Model Reward Model Policy function Value function

Taxi passenger-seeking task: a2 MDP Policy Evaluation a1 s 6 s 5 s 3 s 2 s 4 s 1 v Let π(s) = a 1 ∀ s. γ = 0. v What is the value of this policy? 2

Taxi passenger-seeking task: a2 MDP Control a1 s 6 s 5 s 3 s 2 s 4 s 1 v 6 discrete states (location of the taxi) v 2 actions: Left or Right v How many deterministic policies are there? v Is the optimal policy for a MDP always unique? 2

If policy doesn’t change, can it ever change again? Is there a maximum number of iterations of policy iteration?

Project 1 starts today Due 9/24 mid-night v https://users.wpi.edu/~yli15/courses/DS595 CS525Fall20/Assignments.html 55

Any Comments & Critiques?

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 Last Lecture v What is reinforcement learning? v Difference from other AI problems v Application

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Reinforcement Learning --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Urban Network Analysis -- Urban Mobility Prof. Yanhua Li Time: 6:00pm 8:50pm

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

3 RCD as Topological Sort in this paperis an attested set all of whose el- ements are

1 An Approach for Secure Edge Computing in the Internet of Things Markus Endler,

Repairing Entities using Star Constraints in Multi-relational Graphs Peng Lin 1 Qi Song 1 Yinghui

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

UTSA Community-Based Secure Information and Resource Sharing in AWS Public Cloud Cyber Incident

Optimal Planning and Shortcut Learning: An Unfulfilled Promise Erez Karpas Carmel Domshlak

Deliberation for Social Choice Brandon Fain*[1], Ashish Goel[2], Kamesh Munagala[1] [1] Duke

raSAT: SMT for Polynomial Inequality To Van Khanh (UET/VNU-HN) Vu Xuan Tung, Mizuhito Ogawa

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm - PowerPoint PPT Presentation

This lecture will be recorded!!! Welcome to DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall 2020 Last Lecture v What is reinforcement learning? v Difference from other AI problems v Application

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Reinforcement Learning --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

DS595/CS525: Urban Network Analysis -- Urban Mobility Prof. Yanhua Li Time: 6:00pm 8:50pm

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

3 RCD as Topological Sort in this paperis an attested set all of whose el- ements are

1 An Approach for Secure Edge Computing in the Internet of Things Markus Endler,

Repairing Entities using Star Constraints in Multi-relational Graphs Peng Lin 1 Qi Song 1 Yinghui

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

UTSA Community-Based Secure Information and Resource Sharing in AWS Public Cloud Cyber Incident

Optimal Planning and Shortcut Learning: An Unfulfilled Promise Erez Karpas Carmel Domshlak

Deliberation for Social Choice Brandon Fain*[1], Ashish Goel[2], Kamesh Munagala[1] [1] Duke

raSAT: SMT for Polynomial Inequality To Van Khanh (UET/VNU-HN) Vu Xuan Tung, Mizuhito Ogawa

DS595/CS525: Reinforcement Learning --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm