Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall - PowerPoint PPT Presentation

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2019

Machine Learning Subfield of AI concerned with learning from data . Broadly, using: • Experience • To Improve Performance • On Some Task (Tom Mitchell, 1997)

vs … ML vs Statistics vs Data Mining

Why? Developing effective learning methods has proved difficult. Why bother? Autonomous discovery • We don’t know something, want to find out. Hard to program • Easier to specify task, collect data. Adaptive behavior • Our agents should adapt to new data, unforeseen circumstances.

Types of Machine Learning Depends on feedback available : Labeled data: • Supervised learning No feedback, just data: • Unsupervised learning. Sequential data, weak labels: • Reinforcement learning

Supervised Learning Input: inputs X = {x 1 , …, x n } training data Y = {y 1 , …, y n } labels Learn to predict new labels . Given x: y?

Unsupervised Learning Input: inputs X = {x 1 , …, x n } Try to understand the structure of the data. E.g., how many types of cars? How can they vary?

Reinforcement Learning Learning counterpart of planning. ∞ � γ t r t R = max π : S → A π t =0

MDPs Agent interacts with an environment At each time t: • Receives sensor signal s t • Executes action a t • Transition : • new sensor signal s t +1 • reward r t Goal: find policy that maximizes expected return (sum π of discounted future rewards): � � ∞ � γ t r t E R = max π t =0

Markov Decision Processes : set of states S : set of actions < S, A, γ , R, T > A : discount factor γ : reward function R is the reward received taking action from state R ( s, a, s ′ ) a s and transitioning to state . s ′ : transition function T is the probability of transitioning to state after s ′ T ( s ′ | s, a ) taking action in state . s a

RL vs Planning In planning: • Transition function ( T ) known. • Reward function ( R ) known. • Computation “offline”. In reinforcement learning: • One or both of T, R unknown. • Action in the world only source of data. • Transitions are executed not simulated .

Reinforcement Learning

RL This formulation is general enough to encompass a wide variety of learned control problems.

MDPs As before, our target is a policy : π : S → A A policy maps states to actions . The optimal policy maximizes: � � � ∞ � � γ t r t max R ( s ) = � s 0 = s ∀ s, E � � π t =0 This means that we wish to find a policy that maximizes the return from every state.

Planning via Policy Iteration In planning, we used policy iteration to find an optimal policy. 1. Start with a policy π 2. Estimate V π 3. Improve Repeat π E [ r + γ V π ( s 0 )] , ∀ s π ( s ) = max a. a More precisely, we use a value function: can’t do this " ∞ anymore # X γ i r i V π ( s ) = E i =0 … then we would update by computing: π X T ( s, a, s 0 ) [ r ( s, a, s 0 ) + γ V [ s 0 ]] π ( s ) = argmax a s 0

Value Functions For learning, we use a state-action value function as follows: " ∞ # X γ i r i | s 0 = s, a 0 = a Q π ( s, a ) = E i =0 This is the value of executing in state , then following . a s π Note that . V π ( s ) = Q π ( s, π ( s )) |A| x

Policy Iteration This leads to a general policy improvement framework: 1. Start with a policy π 2. Learn Q π 3. Improve Repeat π a. π ( s ) = max Q ( s, a ) , ∀ s a Steps 2 and 3 can be interleaved as rapidly as you like. Usually, perform 3a every time step .

  Value Function Learning Learning proceeds by gathering samples of . Q ( s, a ) Methods differ by: • How you get the samples. • How you use them to update . Q

Monte Carlo Simplest thing you can do: sample . R ( s ) r r r r r r r r Do this repeatedly, average values: Q ( s, a ) = R 1 ( s ) + R 2 ( s ) + ... + R n ( s ) n

<latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> <latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> <latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> <latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> Temporal Difference Learning Where can we get more (immediate) samples? Idea : use the Bellman equation. Q π ( s, a ) = E s 0 [ r ( s, a, s 0 ) + γ Q π ( s 0 , π ( s 0 ))] reward value of this state value of next state

TD Learning Ideally and in expectation: r i + γ Q ( s i +1 , a i +1 ) − Q ( s i , a i ) = 0 is correct if this holds in expectation for all states. Q When it does not: temporal difference error. s t s t +1 a t r t Q ( s t , a t ) ← r t + γ Q ( s t +1 , a t +1 )

Sarsa Sarsa: very simple algorithm 1. Initialize Q[s][a] = 0 2. For n episodes • observe state s • select a = argmax a Q[s][a] • observe transition ( s, a, r, s ′ , a ′ ) • compute TD error δ = r + γ Q ( s ′ , a ′ ) − Q ( s, a ) • update Q: Q ( s, a ) = Q ( s, a ) + αδ • if not end of episode, repeat zero by def. if s is absorbing

Exploration vs. Exploitation Always max a Q(s, a)? • Exploit current knowledge. What if your current knowledge is wrong? How are you going to find out? • Explore to gain new knowledge. Exploration is mandatory if you want to find the optimal solution, but every exploratory action may sacrifice reward. Exploration vs. Exploitation - when to try new things? Consistent theme of RL.

Exploration vs. Exploitation How to balance? Simplest, most popular approach: Instead of always being greedy: • max a Q(s, a) Explore with probability : ✏ • max a Q(s, a) with probability . (1 − ✏ ) • random action with probability . ✏ - greedy exploration ( ✏ ≈ 0 . 1) ✏ • Very simple • Ensures asymptotic coverage of state space

TD vs. MC TD and MC two extremes of obtaining samples of Q: r + γ V r + γ V r + γ V ... t=1 t=2 t=3 t=4 t=L � γ i r i i ... t=1 t=2 t=3 t=4 t=L

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall - PowerPoint PPT Presentation

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2019 Machine Learning Subfield of AI concerned with learning from data . Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997) vs ML

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

A Language for Probabilistically Oblivious Computation David Darais , Ian Sweet, Chang Liu,

Formal Specification of Cypher Nadime Francis University of Edinburgh Wednesday, May, 10th 1 /

Unified Classical Logic Completeness A Coinductive Pearl Jasmin Blanchette Andrei Popescu

Discrete-Event Systems and Generalized Semi-Markov Processes Reading: Section 1.4 in Shedler or

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values Shangtong Zhang 1 ,

CS 301 Lecture 09 Context-free grammars Stephen Checkoway February 14, 2018 1 / 22

Attraction and Avoidance Detection from Movements Zhenhui Jessie Li (with Bolin Ding, Fei Wu,

Sambuz

Useful Links

Newsletter

Mail Us