CSC 411: Lecture 19: Reinforcement Learning
Class based on Raquel Urtasun & Rich Zemel’s lectures Sanja Fidler
University of Toronto
April 3, 2016
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 1 / 39
CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel - - PowerPoint PPT Presentation
CSC 411: Lecture 19: Reinforcement Learning Class based on Raquel Urtasun & Rich Zemels lectures Sanja Fidler University of Toronto April 3, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 1 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 1 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 2 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 3 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 4 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 5 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 6 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 7 / 39
◮ Supervised: correct outputs ◮ Unsupervised: no feedback, must construct measure of good output ◮ Reinforcement learning
◮ Continuous stream of input information, and actions ◮ Effects of action depend on state of the world ◮ Obtain reward that depends on world state and actions ◮ not correct response, just some feedback Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 8 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 9 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 10 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 11 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 12 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 13 / 39
◮ Take an action at (possibly null action) ◮ Receive some reward rt+1 ◮ Move into a new state st+1
◮ Policy π: agents behaviour function ◮ Value function: how good is each state and/or action ◮ Model: agent’s representation of the environment Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 14 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 15 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 16 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 17 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 18 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 19 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 20 / 39
◮ reward: win/lose/tie the game (+1/ − 1/0) [only at final move in given
◮ state: positions of X’s and O’s on the board ◮ policy: mapping from states to actions ◮ based on rules of game: choice of one open position ◮ value function: prediction of reward in future, based on current state
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 21 / 39
◮ start with all values = 0.5 ◮ policy: choose move with highest
◮ update entries in table based on
◮ After many games value function will
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 22 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 23 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 24 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 25 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 26 / 39
◮ just select optimal action in each state
◮ immediate reward (exploitation) vs. gaining knowledge that might
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 27 / 39
◮ Exploitation: Go to your favourite restaurant ◮ Exploration: Try a new restaurant
◮ Exploitation: Show the most successful advert ◮ Exploration: Show a different advert
◮ Exploitation: Drill at the best known location ◮ Exploration: Drill at a new location
◮ Exploitation: Play the move you believe is best ◮ Exploration: Play an experimental move
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 28 / 39
∞
◮ assume series of questions, increasingly difficult, but increasing payoff ◮ choice: accept accumulated earnings and quit; or continue and risk
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 29 / 39
a
a
◮ This works well if we know δ() and r() ◮ But when we don’t, we cannot choose actions this way Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 30 / 39
a
a
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 31 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 32 / 39
a
a′ Q(st+1, a′)
a′
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 33 / 39
◮ Select an action a and execute it ◮ Receive immediate reward r ◮ Observe the new state s′ ◮ Update the table entry for ˆ
a′
◮ s ← s′
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 34 / 39
a′
a {63, 81, 100} ← 90
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 35 / 39
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 36 / 39
◮ more exploration early on, shift towards exploitation Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 37 / 39
∞
a′ Q(s′, a′)]
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 38 / 39
a′
Urtasun, Zemel, Fidler (UofT) CSC 411: 19-Reinforcement Learning April 3, 2016 39 / 39