reinforcement learning
play

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall - PowerPoint PPT Presentation

Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2019 Machine Learning Subfield of AI concerned with learning from data . Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell, 1997) vs ML


  1. Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2019

  2. Machine Learning Subfield of AI concerned with learning from data . Broadly, using: • Experience • To Improve Performance • On Some Task (Tom Mitchell, 1997)

  3. vs … ML vs Statistics vs Data Mining

  4. Why? Developing effective learning methods has proved difficult. Why bother? Autonomous discovery • We don’t know something, want to find out. Hard to program • Easier to specify task, collect data. Adaptive behavior • Our agents should adapt to new data, unforeseen circumstances.

  5. Types of Machine Learning Depends on feedback available : Labeled data: • Supervised learning No feedback, just data: • Unsupervised learning. Sequential data, weak labels: • Reinforcement learning

  6. Supervised Learning Input: inputs X = {x 1 , …, x n } training data Y = {y 1 , …, y n } labels Learn to predict new labels . Given x: y?

  7. Unsupervised Learning Input: inputs X = {x 1 , …, x n } Try to understand the structure of the data. E.g., how many types of cars? How can they vary?

  8. Reinforcement Learning Learning counterpart of planning. ∞ � γ t r t R = max π : S → A π t =0

  9. MDPs Agent interacts with an environment At each time t: • Receives sensor signal s t • Executes action a t • Transition : • new sensor signal s t +1 • reward r t Goal: find policy that maximizes expected return (sum π of discounted future rewards): � � ∞ � γ t r t E R = max π t =0

  10. Markov Decision Processes : set of states S : set of actions < S, A, γ , R, T > A : discount factor γ : reward function R is the reward received taking action from state R ( s, a, s ′ ) a s and transitioning to state . s ′ : transition function T is the probability of transitioning to state after s ′ T ( s ′ | s, a ) taking action in state . s a

  11. RL vs Planning In planning: • Transition function ( T ) known. • Reward function ( R ) known. • Computation “offline”. In reinforcement learning: • One or both of T, R unknown. • Action in the world only source of data. • Transitions are executed not simulated .

  12. Reinforcement Learning

  13. RL This formulation is general enough to encompass a wide variety of learned control problems.

  14. MDPs As before, our target is a policy : π : S → A A policy maps states to actions . The optimal policy maximizes: � � � ∞ � � γ t r t max R ( s ) = � s 0 = s ∀ s, E � � π t =0 This means that we wish to find a policy that maximizes the return from every state.

  15. Planning via Policy Iteration In planning, we used policy iteration to find an optimal policy. 1. Start with a policy π 2. Estimate V π 3. Improve Repeat π E [ r + γ V π ( s 0 )] , ∀ s π ( s ) = max a. a More precisely, we use a value function: can’t do this " ∞ anymore # X γ i r i V π ( s ) = E i =0 … then we would update by computing: π X T ( s, a, s 0 ) [ r ( s, a, s 0 ) + γ V [ s 0 ]] π ( s ) = argmax a s 0

  16. Value Functions For learning, we use a state-action value function as follows: " ∞ # X γ i r i | s 0 = s, a 0 = a Q π ( s, a ) = E i =0 This is the value of executing in state , then following . a s π Note that . V π ( s ) = Q π ( s, π ( s )) |A| x

  17. Policy Iteration This leads to a general policy improvement framework: 1. Start with a policy π 2. Learn Q π 3. Improve Repeat π a. π ( s ) = max Q ( s, a ) , ∀ s a Steps 2 and 3 can be interleaved as rapidly as you like. Usually, perform 3a every time step .

  18. 
 Value Function Learning Learning proceeds by gathering samples of . Q ( s, a ) Methods differ by: • How you get the samples. • How you use them to update . Q

  19. Monte Carlo Simplest thing you can do: sample . R ( s ) r r r r r r r r Do this repeatedly, average values: Q ( s, a ) = R 1 ( s ) + R 2 ( s ) + ... + R n ( s ) n

  20. <latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> <latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> <latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> <latexit sha1_base64="0dwY2dOBeknUoLo6rJ3xosxTdQI=">ACU3icbVBNaxRBEO0do4nRmDUevRZLO4hBkR4kUIiuAxATcJ7IxLdW/PbrPd0N3jbAM8vyOzx49CToT/Bi7+4czEdB04/36lFVj5daeYrjH53owdbDR9s7j3efPN17t9fnDhbeWEHAmrbvi6KVWhRyRIi2vSifRcC0v+eLjSr/8Jp1XtvhCy1JmBmeFypVACtSkOzr/mpbqyA8B/AeUoM057z+1Exq328g1TKnsVvrQ/D9AbyGdIbGIEDr7A9h8w8GkDo1m1MGk24vPo7XBXdB0oIea+ts0v2VTq2ojCxIaPR+nMQlZTU6UkLZjetvCxRLHAmxwEWaKTP6vX5DbwKzBRy68IrCNbs/4ajfdLw0Pn6jx/W1uR92njivJ3Wa2KsiJZiM2gvNJAFlZwlQ5KUgvA0DhVNgVxBwdCgqJ35jCrV0Qct+EZJLbOdwF2+Ok4DP3/ZOP7QZ7bCX7JAdsYSdsFP2mZ2xERPsmv1kv9mfzvfO3yiKtjatUaf1vGA3Ktr7B1TtsLE=</latexit> Temporal Difference Learning Where can we get more (immediate) samples? Idea : use the Bellman equation. Q π ( s, a ) = E s 0 [ r ( s, a, s 0 ) + γ Q π ( s 0 , π ( s 0 ))] reward value of this state value of next state

  21. TD Learning Ideally and in expectation: r i + γ Q ( s i +1 , a i +1 ) − Q ( s i , a i ) = 0 is correct if this holds in expectation for all states. Q When it does not: temporal difference error. s t s t +1 a t r t Q ( s t , a t ) ← r t + γ Q ( s t +1 , a t +1 )

  22. Sarsa Sarsa: very simple algorithm 1. Initialize Q[s][a] = 0 2. For n episodes • observe state s • select a = argmax a Q[s][a] • observe transition ( s, a, r, s ′ , a ′ ) • compute TD error δ = r + γ Q ( s ′ , a ′ ) − Q ( s, a ) • update Q: Q ( s, a ) = Q ( s, a ) + αδ • if not end of episode, repeat zero by def. if s is absorbing

  23. Sarsa

  24. Sarsa

  25. Exploration vs. Exploitation Always max a Q(s, a)? • Exploit current knowledge. What if your current knowledge is wrong? How are you going to find out? • Explore to gain new knowledge. Exploration is mandatory if you want to find the optimal solution, but every exploratory action may sacrifice reward. Exploration vs. Exploitation - when to try new things? Consistent theme of RL.

  26. Exploration vs. Exploitation How to balance? Simplest, most popular approach: Instead of always being greedy: • max a Q(s, a) Explore with probability : ✏ • max a Q(s, a) with probability . (1 − ✏ ) • random action with probability . ✏ - greedy exploration ( ✏ ≈ 0 . 1) ✏ • Very simple • Ensures asymptotic coverage of state space

  27. TD vs. MC TD and MC two extremes of obtaining samples of Q: r + γ V r + γ V r + γ V ... t=1 t=2 t=3 t=4 t=L � γ i r i i ... t=1 t=2 t=3 t=4 t=L

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend