Refresh Your Knowledge 5 In TD learning with linear VFA (select - PowerPoint PPT Presentation

Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With many slides for DQN from David Silver and Ruslan Salakhutdinov and some vision slides from Gianni Di Caro and images from Stanford CS231n, http://cs231n.github.io/convolutional-networks/ Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 1 / 55

Refresh Your Knowledge 5 In TD learning with linear VFA (select all): w = w + α ( r ( s t ) + γ x ( s t +1 ) T w − x ( s t ) T w ) x ( s t ) 1 V ( s ) = w ( s ) x ( s ) 2 Asymptotic convergence to the true best minimum MSE linear 3 representable V ( s ) is guaranteed for α ∈ (0 , 1), γ < 1. Not sure 4 Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 2 / 55

Class Structure Last time: Value function approximation This time: RL with function approximation, deep RL Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 3 / 55

Table of Contents Control using Value Function Approximation 1 Convolutional Neural Nets (CNNs) 2 Deep Q Learning 3 Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 4 / 55

Control using Value Function Approximation Use value function approximation to represent state-action values ˆ Q π ( s , a ; w ) ≈ Q π Interleave Approximate policy evaluation using value function approximation Perform ǫ -greedy policy improvement Can be unstable. Generally involves intersection of the following: Function approximation Bootstrapping Off-policy learning Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 5 / 55

Linear State Action Value Function Approximation Use features to represent both the state and action   x 1 ( s , a ) x 2 ( s , a )   x ( s , a ) =   . . .   x n ( s , a ) Represent state-action value function with a weighted linear combination of features d Q ( s , a ; w ) = x ( s , a ) T w = ˆ � x j ( s , a ) w j j =1 Gradient descent update: ∇ w J ( w ) = ∇ w ❊ π [( Q π ( s , a ) − ˆ Q π ( s , a ; w )) 2 ] Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 6 / 55

Incremental Model-Free Control Approaches w/Linear VFA Similar to policy evaluation, true state-action value function for a state is unknown and so substitute a target value In Monte Carlo methods, use a return G t as a substitute target ∆ w = α ( G t − ˆ Q ( s t , a t ; w )) ∇ w ˆ Q ( s t , a t ; w ) For SARSA instead use a TD target r + γ ˆ Q ( s ′ , a ′ ; w ) which leverages the current function approximation value α ( r + γ ˆ Q ( s ′ , a ′ ; w ) − ˆ Q ( s , a ; w )) ∇ w ˆ ∆ w = Q ( s , a ; w ) α ( r + γ x ( s ′ , a ′ ) T w − x ( s , a ) T w ) x ( s , a ) = Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 7 / 55

Incremental Model-Free Control Approaches Similar to policy evaluation, true state-action value function for a state is unknown and so substitute a target value In Monte Carlo methods, use a return G t as a substitute target ∆ w = α ( G t − ˆ Q ( s t , a t ; w )) ∇ w ˆ Q ( s t , a t ; w ) For SARSA instead use a TD target r + γ ˆ Q ( s ′ , a ′ ; w ) which leverages the current function approximation value α ( r + γ ˆ Q ( s ′ , a ′ ; w ) − ˆ Q ( s , a ; w )) ∇ w ˆ ∆ w = Q ( s , a ; w ) α ( r + γ x ( s ′ , a ′ ) T w − x ( s , a ) T w ) x ( s , a ) = For Q-learning instead use a TD target r + γ max a ′ ˆ Q ( s ′ , a ′ ; w ) which leverages the max of the current function approximation value Q ( s ′ , a ′ ; w ) − ˆ ˆ Q ( s , a ; w )) ∇ w ˆ ∆ w = α ( r + γ max Q ( s , a ; w ) a ′ a ′ x ( s ′ , a ′ ) T w − x ( s , a ) T w ) x ( s , a ) = α ( r + γ max Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 8 / 55

Convergence of TD Methods with VFA Informally, updates involve doing an (approximate) Bellman backup followed by best trying to fit underlying value function to a particular feature representation Bellman operators are contractions, but value function approximation fitting can be an expansion Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 9 / 55

Challenges of Off Policy Control: Baird Example 1 Behavior policy and target policy are not identical Value can diverge Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 10 / 55

Convergence of Control Methods with VFA Algorithm Tabular Linear VFA Nonlinear VFA Monte-Carlo Control Sarsa Q-learning This is an active area of research An important issue is not just whether the algorithm converges, but what solution it converges too Critical choices: objective function and feature representation Chp 11 SB has a good discussion of some efforts in this direction Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 11 / 55

Linear Value Function Approximation 1 1 Figure from Sutton and Barto 2018 Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 12 / 55

What You Should Understand Be able to implement TD(0) and MC on policy evaluation with linear value function approximation Be able to define what TD(0) and MC on policy evaluation with linear VFA are converging to and when this solution has 0 error and non-zero error. Be able to implement Q-learning and SARSA and MC control algorithms List the 3 issues that can cause instability and describe the problems qualitatively: function approximation, bootstrapping and off policy learning Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 13 / 55

Class Structure Last time and start of this time: Control (making decisions) without a model of how the world works Rest of today: Deep reinforcement learning Next time: Deep RL continued Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 14 / 55

RL with Function Approximation Linear value function approximators assume value function is a weighted combination of a set of features, where each feature a function of the state Linear VFA often work well given the right set of features But can require carefully hand designing that feature set An alternative is to use a much richer function approximation class that is able to directly go from states without requiring an explicit specification of features Local representations including Kernel based approaches have some appealing properties (including convergence results under certain cases) but can’t typically scale well to enormous spaces and datasets Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 15 / 55

Neural Networks 2 2 Figure by Kjell Magne Fauske Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 16 / 55

Deep Neural Networks (DNN) Composition of multiple functions Can use the chain rule to backpropagate the gradient Major innovation: tools to automatically compute gradients for a DNN Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 17 / 55

Deep Neural Networks (DNN) Specification and Fitting Generally combines both linear and non-linear transformations Linear: Non-linear: To fit the parameters, require a loss function (MSE, log likelihood etc) Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 18 / 55

The Benefit of Deep Neural Network Approximators Linear value function approximators assume value function is a weighted combination of a set of features, where each feature a function of the state Linear VFA often work well given the right set of features But can require carefully hand designing that feature set An alternative is to use a much richer function approximation class that is able to directly go from states without requiring an explicit specification of features Local representations including Kernel based approaches have some appealing properties (including convergence results under certain cases) but can’t typically scale well to enormous spaces and datasets Alternative: Deep neural networks Uses distributed representations instead of local representations Universal function approximator Can potentially need exponentially less nodes/parameters (compared to a shallow net) to represent the same function Can learn the parameters using stochastic gradient descent Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 19 / 55

Table of Contents Control using Value Function Approximation 1 Convolutional Neural Nets (CNNs) 2 Deep Q Learning 3 Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 20 / 55

Why Do We Care About CNNs? CNNs extensively used in computer vision If we want to go from pixels to decisions, likely useful to leverage insights for visual input Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 21 / 55

Fully Connected Neural Net Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 22 / 55

Refresh Your Knowledge 5 In TD learning with linear VFA (select - PowerPoint PPT Presentation

Lecture 6: CNNs and Deep Q Learning 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With many slides for DQN from David Silver and Ruslan Salakhutdinov and some vision slides from Gianni Di Caro and images from Stanford CS231n,

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Winston-Salem State University Undergrad Brand Refresh WSSU UG Brand Refresh | Progress IMC:

Refresh RUS 1 Network RUS: Electrification Refresh 30 year perspective as part of Long

JSNA Refresh www.WalsallIntelligence.org.uk JSNA 2018-2019 refresh Gap in life expectancy Ageing

Cleaning Up the Clutter: Refresh Your MadCap Flare Project Design PRESENTED BY Nate Wolf

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Lower Don Trail Master Plan Refresh Public Open House_September 17 2019 1 Lower Don Trail

and Sustainable Food Systems in the Circular Economy REFRESH final Policy Workshop Brussels, 22

SCIM Refresh Presented by Paul Mortimer Asset Management Policy Advisor paulmortimer@nhs.net 1

PC Refresh in the Consumerized IT Environment David Buchholz, Director of Consumerization, Intel

Adding value to food waste and by-products REFRESH Community of Experts webinar series

Voluntary Agreements to Address Food Waste REFRESH Community of Experts webinar series

VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to

Measuring and managing retail food waste REFRESH Community of Experts webinar series

Renewal Monte Carlo: Renewal theory based reinforcement learning Jayakumar Subramanian and

A General Approach to Discovering, Registering, and Extracting Features from Raster Maps Craig

Managing Performance Across Payers Getting Different Populations on the Same Page October 21,

free 18-May-17 Towards Weakly Supervised Image Understanding 1/50 Towards Weakly Supervised

Housing C Housing Counseling ounseling System (HCS) T System (HCS) Today oday and i and in

The Password Doesnt Fall Far: How Service Influences Password Choice Miranda Wei, The

A wavelet based approach to climate biome clustering Derek Desantis University of Nebraska -

Characterizing Social Insider Attacks on Facebook Wali Ahmed Usmani, Diogo Marques, Ivan

Sambuz

Useful Links

Newsletter

Mail Us