cs325 artificial intelligence ch 21 reinforcement learning
play

CS325 Artificial Intelligence Ch. 21 Reinforcement Learning Cengiz - PowerPoint PPT Presentation

CS325 Artificial Intelligence Ch. 21 Reinforcement Learning Cengiz Gnay, Emory Univ. Spring 2013 Gnay Ch. 21 Reinforcement Learning Spring 2013 1 / 23 Rats! Rat put in a cage with lever. Each lever press sends a signal to


  1. CS325 Artificial Intelligence Ch. 21 – Reinforcement Learning Cengiz Günay, Emory Univ. Spring 2013 Günay Ch. 21 – Reinforcement Learning Spring 2013 1 / 23

  2. Rats! Rat put in a cage with lever. Each lever press sends a signal to rat’s brain, to the reward center . Fundooprofessor Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23

  3. Rats! Rat put in a cage with lever. Each lever press sends a signal to rat’s brain, to the reward center . Rat presses lever continously until . . . Fundooprofessor Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23

  4. Rats! Rat put in a cage with lever. Each lever press sends a signal to rat’s brain, to the reward center . Rat presses lever continously until . . . it dies because it stops eating and drinking. Fundooprofessor Günay Ch. 21 – Reinforcement Learning Spring 2013 2 / 23

  5. Wikipedia.org

  6. Dopamine Neurons Respond to Novelty Schultz et al. (1997) sciencemuseum.org.uk Günay Ch. 21 – Reinforcement Learning Spring 2013 4 / 23

  7. Dopamine Neurons Respond to Novelty Schultz et al. (1997) sciencemuseum.org.uk It turns out: Novelty detection = Temporal Difference rule in Reinforcement Learning (Sutton and Barto, 1981) Günay Ch. 21 – Reinforcement Learning Spring 2013 4 / 23

  8. Performance standard Sensors Critic feedback Environment changes Learning Performance element element knowledge learning goals Problem generator Actuators Agent

  9. Entry/Exit Surveys Exit survey: Planning Under Uncertainty Why can’t we use a regular MDP for partially-observable situations? Give an example where you think MDPs would help you solve a problem in your daily life. Entry survey: Reinforcement Learning (0.25 points of final grade) In a partially-observable scenario, can reinforcement be used to learn MDP rewards? How can we improve MDP by using the plan-execute cycle? Günay Ch. 21 – Reinforcement Learning Spring 2013 7 / 23

  10. Blindfolded MDPs: Enter Reinforcement Learning 1 2 3 4 a G b xt c S What if the agent does not know anything about: where walls are where goals/penalties are Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23

  11. Blindfolded MDPs: Enter Reinforcement Learning 1 2 3 4 a G b xt c S What if the agent does not know anything about: where walls are where goals/penalties are Can we use the plan-execute cycle? Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23

  12. Blindfolded MDPs: Enter Reinforcement Learning 1 2 3 4 a G b xt c S What if the agent does not know anything about: where walls are where goals/penalties are Can we use the plan-execute cycle? Explore first Update world state based on reward/reinforcement ⇒ Reinforcement Learning (see Scholarpedia article) Günay Ch. 21 – Reinforcement Learning Spring 2013 8 / 23

  13. Where Does Reinforcement Learning Fit? Machine learning so far: Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

  14. Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

  15. Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f ( x ) → y Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

  16. Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f ( x ) → y Reinforcement learning: find mapping between states and actions, s → a Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

  17. Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f ( x ) → y Reinforcement learning: find mapping between states and actions, s → a (by finding optimal policy, π ( s ) → a ) Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

  18. Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f ( x ) → y Reinforcement learning: find mapping between states and actions, s → a (by finding optimal policy, π ( s ) → a ) Which is it? S U R Speech recognition: connect sounds to transcripts Star data: find groupings from spectral emissions Rat presses lever: gets reward based on certain conditions Elevator controller: multiple elevators, minimize wait time Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

  19. Where Does Reinforcement Learning Fit? Machine learning so far: Unsupervised learning: find regularities in input data, x Supervised learning: find mapping between input and output, f ( x ) → y Reinforcement learning: find mapping between states and actions, s → a (by finding optimal policy, π ( s ) → a ) Which is it? S U R X Speech recognition: connect sounds to transcripts X Star data: find groupings from spectral emissions X Rat presses lever: gets reward based on certain conditions X Elevator controller: multiple elevators, minimize wait time Günay Ch. 21 – Reinforcement Learning Spring 2013 9 / 23

  20. But, Wasn’t That What Markov Decision Processes Were? Find optimal policy to maximize reward: � ∞ � � π ( s ) = arg max E γ t R ( s , π ( s ) , s ′ ) , π t = 0 with reward at state: R ( s ) , or from action, R ( s , a , s ′ ) . Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23

  21. But, Wasn’t That What Markov Decision Processes Were? Find optimal policy to maximize reward: � ∞ � � π ( s ) = arg max E γ t R ( s , π ( s ) , s ′ ) , π t = 0 with reward at state: R ( s ) , or from action, R ( s , a , s ′ ) . By estimating utility values: � � � P ( s ′ | s , a ) V ( s ′ ) V ( s ) ← arg max γ + R ( s ) , a s ′ with transition probabilities: P ( s ′ | s , a ) Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23

  22. But, Wasn’t That What Markov Decision Processes Were? Find optimal policy to maximize reward: � ∞ � � π ( s ) = arg max E γ t R ( s , π ( s ) , s ′ ) , π t = 0 with reward at state: R ( s ) , or from action, R ( s , a , s ′ ) . By estimating utility values: � � � P ( s ′ | s , a ) V ( s ′ ) V ( s ) ← arg max γ + R ( s ) , a s ′ with transition probabilities: P ( s ′ | s , a ) Assumes we know R ( s ) and P ( s ′ | s , a ) Günay Ch. 21 – Reinforcement Learning Spring 2013 10 / 23

  23. Blindfolded Agent Must Learn From Rewards Don’t know R ( s ) or P ( s ′ | s , a ) . What to do? Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23

  24. Blindfolded Agent Must Learn From Rewards Don’t know R ( s ) or P ( s ′ | s , a ) . What to do? Use Reinforcement Learning (RL) to explore and find rewards Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23

  25. Blindfolded Agent Must Learn From Rewards Don’t know R ( s ) or P ( s ′ | s , a ) . What to do? Use Reinforcement Learning (RL) to explore and find rewards Agent types: knows learns uses Utility agent P R → U U Q-learning (RL) Q ( s , a ) Q Reflex π ( s ) Günay Ch. 21 – Reinforcement Learning Spring 2013 11 / 23

  26. Video: Backgammon and Choppers

  27. How Much to Learn? 1 Passive RL: Simple Case Keep policy π ( s ) fixed, learn others Always do same actions, and learn utilities Examples: public transit commute learning a difficult game Günay Ch. 21 – Reinforcement Learning Spring 2013 13 / 23

  28. How Much to Learn? 1 Passive RL: Simple Case Keep policy π ( s ) fixed, learn others Always do same actions, and learn utilities Examples: public transit commute learning a difficult game 2 Active RL Learn policy at the same time Help explore better by changing policy Example: drive own car Günay Ch. 21 – Reinforcement Learning Spring 2013 13 / 23

  29. RL in Practise: Temporal Difference (TD) Rule Animals use derivative: Remember value iteration: � � � P ( s ′ | s , a ) V ( s ′ ) V ( s ) ← arg max γ + R ( s ) . a s ′ Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23

  30. RL in Practise: Temporal Difference (TD) Rule Animals use derivative: Remember value iteration: � � � P ( s ′ | s , a ) V ( s ′ ) V ( s ) ← arg max γ + R ( s ) . a s ′ TD rule: Use derivative when going s → s ′ : � R ( s ) + γ V ( s ′ ) − V ( s ) � V ( s ) ← V ( s ) + α where: α is the learning rate , and γ is the discount factor. Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23

  31. RL in Practise: Temporal Difference (TD) Rule Animals use derivative: Remember value iteration: � � � P ( s ′ | s , a ) V ( s ′ ) V ( s ) ← arg max γ + R ( s ) . a s ′ TD rule: Use derivative when going s → s ′ : � R ( s ) + γ V ( s ′ ) − V ( s ) � V ( s ) ← V ( s ) + α where: α is the learning rate , and γ is the discount factor. It’s even simpler than before! Günay Ch. 21 – Reinforcement Learning Spring 2013 14 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend