Module 11 Introduction to Reinforcement Learning CS 886 Sequential - PowerPoint PPT Presentation

Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Machine Learning • Supervised Learning – Teacher tells learner what to remember • Reinforcement Learning – Environment provides hints to learner • Unsupervised Learning – Learner discovers on its own 2 CS886 (c) 2013 Pascal Poupart

Animal Psychology • Negative reinforcements: – Pain and hunger • Positive reinforcements: – Pleasure and food • Reinforcements used to train animals • Let’s do the same with computers! 3 CS886 (c) 2013 Pascal Poupart

Reinforcement Learning • Definition: – Markov decision process with unknown transition and reward models • Set of states S • Set of actions A – Actions may be stochastic • Set of reinforcement signals (rewards) – Rewards may be delayed 5 CS886 (c) 2013 Pascal Poupart

Policy optimization • Markov Decision Process: – Find optimal policy given transition and reward model – Execute policy found • Reinforcement learning: – Learn an optimal policy while interacting with the environment 6 CS886 (c) 2013 Pascal Poupart

Reinforcement Learning Problem Agent State Action Reward Environment a0 a1 a2 … s0 s1 s2 r1 r2 r0 Goal: Learn to choose actions that maximize 1 + 𝛿 2 𝑠 2 + 𝛿 3 𝑠 𝑠 0 + 𝛿𝑠 3 + ⋯ 7 CS886 (c) 2013 Pascal Poupart

Example: Inverted Pendulum • State: 𝑦 𝑢 , 𝑦 ′ 𝑢 , 𝜄 𝑢 , 𝜄′(𝑢) • Action: Force 𝐺 • Reward: 1 for any step where pole balanced Problem: Find 𝜌: 𝑇 → 𝐵 that maximizes rewards 8 CS886 (c) 2013 Pascal Poupart

Types of RL • Passive vs Active learning – Passive learning: the agent executes a fixed policy and tries to evaluate it – Active learning: the agent updates its policy as it learns • Model based vs model free – Model-based: learn transition and reward model and use it to determine optimal policy – Model free: derive optimal policy without learning the model 9 CS886 (c) 2013 Pascal Poupart

Passive Learning • Transition and reward model known: – Evaluate 𝜌 : 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) + 𝛿 ∀𝑡 𝑡 ′ • Transition and reward model unknown: – Estimate value of policy as agent executes policy: 𝑊 𝜌 𝑡 = 𝐹 𝜌 [ 𝛿 𝑢 𝑆(𝑡 𝑢 , 𝜌 𝑡 𝑢 )] 𝑢 – Model based vs model free 10 CS886 (c) 2013 Pascal Poupart

Passive learning r r r +1  = 1 3 u u -1 Reward is -0.04 for 2 non-terminal states u l l l 1 Do not know the transition 1 2 3 4 probabilities (1,1)  (1,2)  (1,3)  (1,2)  (1,3)  (2,3)  (3,3)  (4,3) +1 (1,1)  (1,2)  (1,3)  (2,3)  (3,3)  (3,2)  (3,3)  (4,3) +1 (1,1)  (2,1)  (3,1)  (3,2)  (4,2) -1 What is the value 𝑊(𝑡) of being in state 𝑡 ? 11 CS886 (c) 2013 Pascal Poupart

ADP Example r r r +1  = 1 We need to 3 learn all the u u -1 2 Reward is -0.04 for transition u l l l non-terminal states probabilities! 1 1 2 3 4 (1,1)  (1,2)  (1,3)  (1,2)  (1,3)  (2,3)  (3,3)  (4,3) +1 (1,1)  (1,2)  (1,3)  (2,3)  (3,3)  (3,2)  (3,3)  (4,3) +1 (1,1)  (2,1)  (3,1)  (3,2)  (4,2) -1 P((2,3)|(1,3),r) =2/3 Use this information in P((1,2)|(1,3),r) =1/3 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) + 𝛿 𝑡 ′ 13 CS886 (c) 2013 Pascal Poupart

Passive ADP PassiveADP( 𝜌 ) Repeat Execute 𝜌(𝑡) Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1, 𝑜 𝑡, 𝑡 ′ ← 𝑜 𝑡, 𝑡 ′ + 1 𝑜(𝑡,𝑡 ′ ) Update transition: Pr 𝑡 ′ 𝑡, 𝜌(𝑡) ← 𝑜(𝑡) ∀𝑡′ 𝑠 + 𝑜 𝑡 −1 𝑆(𝑡,𝜌 𝑡 ) Update reward: 𝑆 𝑡, 𝜌 𝑡 ← 𝑜(𝑡) Solve: 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 𝑡 ′ ∀𝑡 + 𝛿 𝑡 ′ 𝑡 ← 𝑡′ Until convergence of 𝑊 𝜌 Return 𝑊 𝜌 14 CS886 (c) 2013 Pascal Poupart

Passive TD • Temporal difference (TD) – Model free • At each time step – Observe: 𝑡, 𝑏, 𝑡’, 𝑠 – Update 𝑊 𝜌 (𝑡) after each move – 𝑊 𝜌 (𝑡) = 𝑊 𝜌 (𝑡) + 𝛽 (𝑆(𝑡, 𝜌(𝑡)) + 𝛿 𝑊 𝜌 (𝑡’) – 𝑊 𝜌 (𝑡)) Learning rate Temporal difference 15 CS886 (c) 2013 Pascal Poupart

TD Convergence Theorem: If 𝛽 is appropriately decreased with number of times a state is visited then 𝑊 𝜌 (𝑡) converges to correct value • 𝛽 must satisfy: • 𝛽 𝑢 → ∞ 𝑢 • 𝛽 𝑢 2 < ∞ 𝑢 • Often 𝛽 𝑡 = 1/𝑜(𝑡) • Where 𝑜(𝑡) = # of times 𝑡 is visited 16 CS886 (c) 2013 Pascal Poupart

Passive TD PassiveTD( 𝜌, 𝑊 𝜌 ) Repeat Execute 𝜌(𝑡) Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update value: 𝑊 𝜌 𝑡 ← 𝑊 𝜌 𝑡 + 𝛽(𝑠 + 𝛿𝑊 𝜌 𝑡 ′ − 𝑊 𝜌 𝑡 ) 𝑡 ← 𝑡′ Until convergence of 𝑊 𝜌 Return 𝑊 𝜌 17 CS886 (c) 2013 Pascal Poupart

Active Learning • Ultimately, we are interested in improving 𝜌 • Transition and reward model known: 𝑊 ∗ 𝑡 = max 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 ∗ (𝑡 ′ ) 𝑏 𝑡 ′ • Transition and reward model unknown: – Improve policy as agent executes policy – Model based vs model free 19 CS886 (c) 2013 Pascal Poupart

Q-learning (aka active temporal difference) • Q-function: 𝑅: 𝑇 × 𝐵 → ℜ – Value of state-action pair – Policy 𝜌 𝑡 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑅(𝑡, 𝑏) is the greedy policy w.r.t. 𝑅 • Bellman’s equation: 𝑅 ∗ 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 Pr (𝑡 ′ |𝑡, 𝑏) 𝑏 ′ 𝑅 ∗ (𝑡′, 𝑏 ′ ) max 𝑡 ′ 20 CS886 (c) 2013 Pascal Poupart

Q-Learning Qlearning( 𝑡, 𝑅 ∗ ) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update Q-value: 𝑅 ∗ 𝑡, 𝑏 ← 𝑅 ∗ 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max 𝑏 ′ 𝑅 ∗ 𝑡 ′ , 𝑏 ′ − 𝑅 ∗ 𝑡, 𝑏 𝑡 ← 𝑡′ Until convergence of 𝑅 Return 𝑅 21 CS886 (c) 2013 Pascal Poupart

Q-learning example s 1 s 2 s 2 73 81.5 100 100 66 66 81 81  = 0.9 ,  = 0.5 , 𝑠 = 0 for non-terminal states 𝑏 ′ 𝑅 𝑡 2 , 𝑏 ′ − 𝑅 𝑡 1 , 𝑠𝑗𝑕ℎ𝑢 𝑅 𝑡 1 , 𝑠𝑗𝑕ℎ𝑢 = 𝑅 𝑡 1 , 𝑠𝑗𝑕ℎ𝑢 + 𝛽 𝑠 + 𝛿 max = 73 + 0.5 0 + 0.9 max 66,81,100 − 73 = 73 + 0.5(17) = 81.5 22 CS886 (c) 2013 Pascal Poupart

Q-Learning Qlearning( 𝑡, 𝑅 ∗ ) Repeat Select and execute 𝑏 Observe 𝑡’ and 𝑠 Update counts: 𝑜 𝑡 ← 𝑜 𝑡 + 1 Learning rate: 𝛽 ← 1/𝑜(𝑡) Update Q-value: 𝑅 ∗ 𝑡, 𝑏 ← 𝑅 ∗ 𝑡, 𝑏 + 𝛽 𝑠 + 𝛿 max 𝑏 ′ 𝑅 ∗ 𝑡 ′ , 𝑏 ′ − 𝑅 ∗ 𝑡, 𝑏 𝑡 ← 𝑡′ Until convergence of 𝑅 ∗ Return 𝑅 ∗ 23 CS886 (c) 2013 Pascal Poupart

Exploration vs Exploitation • If an agent always chooses the action with the highest value then it is exploiting – The learned model is not the real model – Leads to suboptimal results • By taking random actions (pure exploration) an agent may learn the model – But what is the use of learning a complete model if parts of it are never used? • Need a balance between exploitation and exploration 24 CS886 (c) 2013 Pascal Poupart

Common exploration methods •  -greedy: – With probability 𝜗 execute random action – Otherwise execute best action 𝑏 ∗ 𝑏 ∗ = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑅(𝑡, 𝑏) • Boltzmann exploration 𝑅 𝑡,𝑏 𝑓 𝑈 Pr 𝑏 = 𝑅 𝑡,𝑏 𝑓 𝑈 𝑏 25 CS886 (c) 2013 Pascal Poupart

Exploration and Q-learning • Q-learning converges to optimal Q-values if – Every state is visited infinitely often (due to exploration) – The action selection becomes greedy as time approaches infinity – The learning rate 𝛽 is decreased fast enough, but not too fast 26 CS886 (c) 2013 Pascal Poupart

Module 11 Introduction to Reinforcement Learning CS 886 Sequential - PowerPoint PPT Presentation

Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Machine Learning Supervised Learning Teacher tells learner what to remember Reinforcement Learning

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

Accelerating lattice reduction algorithms with floating-point arithmetic Damien Stehl e

Vivian de la Incera Sharif University of Technology Webinar August 4, 2020 1 Outline 1. Why

Ethereum and Solidity Prof. Tom Austin San Jos State University Bitcoin (BTC) Protocol

Ideal lattices in multicubic fields Andrea LESAVOUREY Thomas PLANTARD Willy SUSILO School of

Kinematic Redundancy A manipulator may have more DOFs than are necessary to control a

I vrr"r Syr 4c'r cf e,rr fr'.t 16., u,l1o brcr'c t?s . I Aul ttPbc o-tto-t r

Cluster algebras, snake graphs and continued fractions Ralf Schiffler Intro Cluster algebras

CHAPTER 3: PROCUREMENT REQUIREMENTS DEHCR BUREAU OF COMMUNITY

Module 11 Introduction to Reinforcement Learning CS 886 Sequential - PowerPoint PPT Presentation

Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Machine Learning Supervised Learning Teacher tells learner what to remember Reinforcement Learning

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting &amp; Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

Accelerating lattice reduction algorithms with floating-point arithmetic Damien Stehl e

Vivian de la Incera Sharif University of Technology Webinar August 4, 2020 1 Outline 1. Why

Ethereum and Solidity Prof. Tom Austin San Jos State University Bitcoin (BTC) Protocol

Ideal lattices in multicubic fields Andrea LESAVOUREY Thomas PLANTARD Willy SUSILO School of

Kinematic Redundancy A manipulator may have more DOFs than are necessary to control a

I vrr&quot;r Syr 4c'r cf e,rr fr'.t 16., u,l1o brcr'c t?s . I Aul ttPbc o-tto-t r

Cluster algebras, snake graphs and continued fractions Ralf Schiffler Intro Cluster algebras

CHAPTER 3: PROCUREMENT REQUIREMENTS DEHCR BUREAU OF COMMUNITY

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

I vrr"r Syr 4c'r cf e,rr fr'.t 16., u,l1o brcr'c t?s . I Aul ttPbc o-tto-t r