lecture 21 reinforcement learning
play

Lecture 21: Reinforcement Learning Justin Johnson December 4, 2019 - PowerPoint PPT Presentation

Lecture 21: Reinforcement Learning Justin Johnson December 4, 2019 Lecture 21 - 1 Assignment 5: Object Detection Single-stage detector Two-stage detector Due on Monday 12/9, 11:59pm Justin Johnson December 4, 2019 Lecture 21 - 2


  1. Value Function and Q Function Following a policy 𝜌 produces sample trajectories (or paths) s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: π‘Š < 𝑑 = 𝔽 > 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝜌 +?. How good is a state-action pair? The Q function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: 𝑅 < 𝑑, 𝑏 = 𝔽 > 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝑏 . = 𝑏, 𝜌 +?. Justin Johnson December 4, 2019 Lecture 21 - 35

  2. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 βˆ— 𝑑, 𝑏 = max 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Justin Johnson December 4, 2019 Lecture 21 - 36

  3. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 βˆ— 𝑑, 𝑏 = max 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Q* encodes the optimal policy: 𝜌 βˆ— 𝑑 = arg max BC 𝑅(𝑑, 𝑏 C ) Justin Johnson December 4, 2019 Lecture 21 - 37

  4. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 βˆ— 𝑑, 𝑏 = max 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Q* encodes the optimal policy: 𝜌 βˆ— 𝑑 = arg max BC 𝑅(𝑑, 𝑏 C ) Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Justin Johnson December 4, 2019 Lecture 21 - 38

  5. Bellman Equation Optimal Q-function: Q * (s, a) is the Q-function for the optimal policy 𝜌 * It gives the max possible future reward when taking action a in state s: 𝑅 βˆ— 𝑑, 𝑏 = max 𝛿 + 𝑠 + | 𝑑 . = 𝑑, 𝑏 . = 𝑏, 𝜌 𝔽 > < +?. Q* encodes the optimal policy: 𝜌 βˆ— 𝑑 = arg max BC 𝑅(𝑑, 𝑏 C ) Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Intuition : After taking action a in state s, we get reward r and move to a new BC 𝑅 βˆ— 𝑑 C , 𝑏′ state s’. After that, the max possible reward we can get is max Justin Johnson December 4, 2019 Lecture 21 - 39

  6. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Justin Johnson December 4, 2019 Lecture 21 - 40

  7. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑑 C , 𝑏′ 𝑅 H23 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Justin Johnson December 4, 2019 Lecture 21 - 41

  8. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑑 C , 𝑏′ 𝑅 H23 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Amazing fact : Q i converges to Q * as 𝑗 β†’ ∞ Justin Johnson December 4, 2019 Lecture 21 - 42

  9. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑑 C , 𝑏′ 𝑅 H23 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Amazing fact : Q i converges to Q * as 𝑗 β†’ ∞ Problem : Need to keep track of Q(s, a) for all (state, action) pairs – impossible if infinite Justin Johnson December 4, 2019 Lecture 21 - 43

  10. Solving for the optimal policy: Value Iteration Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Idea : If we find a function Q(s, a) that satisfies the Bellman Equation, then it must be Q *. Start with a random Q, and use the Bellman Equation as an update rule: BC 𝑅 H 𝑑 C , 𝑏′ 𝑅 H23 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Amazing fact : Q i converges to Q * as 𝑗 β†’ ∞ Problem : Need to keep track of Q(s, a) for all (state, action) pairs – impossible if infinite Solution : Approximate Q(s, a) with a neural network, use Bellman Equation as loss! Justin Johnson December 4, 2019 Lecture 21 - 44

  11. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Train a neural network (with weights ΞΈ) to approximate Q * : 𝑅 βˆ— 𝑑, 𝑏 β‰ˆ 𝑅 𝑑, 𝑏; πœ„ Justin Johnson December 4, 2019 Lecture 21 - 45

  12. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Train a neural network (with weights ΞΈ) to approximate Q * : 𝑅 βˆ— 𝑑, 𝑏 β‰ˆ 𝑅 𝑑, 𝑏; πœ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 Justin Johnson December 4, 2019 Lecture 21 - 46

  13. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Train a neural network (with weights ΞΈ) to approximate Q * : 𝑅 βˆ— 𝑑, 𝑏 β‰ˆ 𝑅 𝑑, 𝑏; πœ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 S Use this to define the loss for training Q: 𝑀 𝑑, 𝑏 = 𝑅 𝑑, 𝑏; πœ„ βˆ’ 𝑧 E,B,P Justin Johnson December 4, 2019 Lecture 21 - 47

  14. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Train a neural network (with weights ΞΈ) to approximate Q * : 𝑅 βˆ— 𝑑, 𝑏 β‰ˆ 𝑅 𝑑, 𝑏; πœ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 S Use this to define the loss for training Q: 𝑀 𝑑, 𝑏 = 𝑅 𝑑, 𝑏; πœ„ βˆ’ 𝑧 E,B,P Problem : Nonstationary! The β€œtarget” for Q(s, a) depends on the current weights ΞΈ! Justin Johnson December 4, 2019 Lecture 21 - 48

  15. Solving for the optimal policy: Deep Q-Learning Bellman Equation : Q * satisfies the following recurrence relation: 𝑅 βˆ— 𝑑, 𝑏 = 𝔽 D,EC 𝑠 + 𝛿 max BC 𝑅 βˆ— 𝑑 C , 𝑏′ Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄(𝑑, 𝑏) Train a neural network (with weights ΞΈ) to approximate Q * : 𝑅 βˆ— 𝑑, 𝑏 β‰ˆ 𝑅 𝑑, 𝑏; πœ„ Use the Bellman Equation to tell what Q should output for a given state and action: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 S Use this to define the loss for training Q: 𝑀 𝑑, 𝑏 = 𝑅 𝑑, 𝑏; πœ„ βˆ’ 𝑧 E,B,P Problem : Nonstationary! The β€œtarget” for Q(s, a) depends on the current weights ΞΈ! Problem : How to sample batches of data for training? Justin Johnson December 4, 2019 Lecture 21 - 49

  16. Case Study: Playing Atari Games Objective : Complete the game with the highest score State: Raw pixel inputs of the game screen Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step Mnih et al, β€œPlaying Atari with Deep Reinforcement Learning”, NeurIPS Deep Learning Workshop, 2013 Justin Johnson December 4, 2019 Lecture 21 - 50

  17. Mnih et al, β€œPlaying Atari with Deep Reinforcement Learning”, NeurIPS Deep Learning Workshop, 2013 Case Study: Playing Atari Games Network output : Q-values for all actions With 4 actions: last 𝑅 𝑑, 𝑏; πœ„ layer gives values FC-A (Q-values) Neural network Q(s t , a 1 ), Q(s t , a 2 ), FC-256 with weights ΞΈ Q(s t , a 3 ), Q(s t ,a 4 ) Conv(16->32, 4x4, stride 2) Conv(4->16, 8x8, stride 4) Network input: state s t : 4x84x84 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) Justin Johnson December 4, 2019 Lecture 21 - 51

  18. https://www.youtube.com/watch?v=V1eYniJ0Rnk Justin Johnson December 4, 2019 Lecture 21 - 52

  19. Q-Learning Q-Learning : Train network 𝑅 P 𝑑, 𝑏 to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Justin Johnson December 4, 2019 Lecture 21 - 53

  20. Q-Learning vs Policy Gradients Q-Learning : Train network 𝑅 P 𝑑, 𝑏 to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Policy Gradients : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Justin Johnson December 4, 2019 Lecture 21 - 54

  21. Q-Learning vs Policy Gradients Q-Learning : Train network 𝑅 P 𝑑, 𝑏 to estimate future rewards for every (state, action) pair Problem : For some problems this can be a hard function to learn. For some problems it is easier to learn a mapping from states to actions Policy Gradients : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Objective function : Expected future rewards when following policy 𝜌 P : 𝛿 + 𝑠 + 𝐾 πœ„ = 𝔽 D~U V > +?. Find the optimal policy by maximizing: πœ„ βˆ— = arg max P 𝐾 πœ„ (Use gradient ascent!) Justin Johnson December 4, 2019 Lecture 21 - 55

  22. Policy Gradients Objective function : Expected future rewards when following policy 𝜌 P : 𝛿 + 𝑠 + 𝐾 πœ„ = 𝔽 D~U V > +?. Find the optimal policy by maximizing: πœ„ βˆ— = arg max P 𝐾 πœ„ (Use gradient ascent!) WX Problem : Nondifferentiability! Don’t know how to compute WP Justin Johnson December 4, 2019 Lecture 21 - 56

  23. Policy Gradients Objective function : Expected future rewards when following policy 𝜌 P : 𝛿 + 𝑠 + 𝐾 πœ„ = 𝔽 D~U V > +?. Find the optimal policy by maximizing: πœ„ βˆ— = arg max P 𝐾 πœ„ (Use gradient ascent!) WX Problem : Nondifferentiability! Don’t know how to compute WP WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP Justin Johnson December 4, 2019 Lecture 21 - 57

  24. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP Justin Johnson December 4, 2019 Lecture 21 - 58

  25. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ Justin Johnson December 4, 2019 Lecture 21 - 59

  26. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ Justin Johnson December 4, 2019 Lecture 21 - 60

  27. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ πœ– 1 πœ–πœ„ π‘ž P 𝑦 β‡’ πœ– πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = πœ–πœ„ π‘ž P 𝑦 = π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 π‘ž P 𝑦 Justin Johnson December 4, 2019 Lecture 21 - 61

  28. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ πœ– 1 πœ–πœ„ π‘ž P 𝑦 β‡’ πœ– πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = πœ–πœ„ π‘ž P 𝑦 = π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 π‘ž P 𝑦 Justin Johnson December 4, 2019 Lecture 21 - 62

  29. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ πœ– 1 πœ–πœ„ π‘ž P 𝑦 β‡’ πœ– πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = πœ–πœ„ π‘ž P 𝑦 = π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 π‘ž P 𝑦 Justin Johnson December 4, 2019 Lecture 21 - 63

  30. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ πœ– 1 πœ–πœ„ π‘ž P 𝑦 β‡’ πœ– πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = πœ–πœ„ π‘ž P 𝑦 = π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 π‘ž P 𝑦 πœ–πΎ πœ– πœ– πœ–πœ„ = ] 𝑔 𝑦 π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 𝑒𝑦 = 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ log π‘ž P 𝑦 ^ Justin Johnson December 4, 2019 Lecture 21 - 64

  31. Policy Gradients: REINFORCE Algorithm WX General formulation : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 Want to compute WP πœ–πœ„ = πœ– πœ–πΎ = πœ– πœ– πœ–πœ„ 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ ] π‘ž P 𝑦 𝑔 𝑦 𝑒𝑦 = ] 𝑔 𝑦 πœ–πœ„ π‘ž P 𝑦 𝑒𝑦 ^ ^ πœ– 1 πœ–πœ„ π‘ž P 𝑦 β‡’ πœ– πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = πœ–πœ„ π‘ž P 𝑦 = π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 π‘ž P 𝑦 πœ–πΎ πœ– πœ– πœ–πœ„ = ] 𝑔 𝑦 π‘ž P 𝑦 πœ–πœ„ log π‘ž P 𝑦 𝑒𝑦 = 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ log π‘ž P 𝑦 ^ Approximate the expectation via sampling! Justin Johnson December 4, 2019 Lecture 21 - 65

  32. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Justin Johnson December 4, 2019 Lecture 21 - 66

  33. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Justin Johnson December 4, 2019 Lecture 21 - 67

  34. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Transition probabilities of environment. We can’t compute this. Justin Johnson December 4, 2019 Lecture 21 - 68

  35. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Transition probabilities Action probabilities of environment. We of policy. We can can’t compute this. are learning this! Justin Johnson December 4, 2019 Lecture 21 - 69

  36. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Transition probabilities Action probabilities πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + of environment. We of policy. We can can’t compute this. are learning this! +?. Justin Johnson December 4, 2019 Lecture 21 - 70

  37. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 π‘ž P 𝑦 = d 𝑄 𝑑 +23 | 𝑑 + 𝜌 P 𝑏 + | 𝑑 + β‡’ log π‘ž P (𝑦) = > log 𝑄 𝑑 +23 |𝑑 + + log 𝜌 P 𝑏 + |𝑑 + +?. +?. Transition probabilities Action probabilities πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + of environment. We of policy. We can can’t compute this. are learning this! +?. Justin Johnson December 4, 2019 Lecture 21 - 71

  38. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 πœ– πœ– πœ–πœ„ log π‘ž P 𝑦 = > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 72

  39. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 πœ– πœ– Expected reward under 𝜌 P : πœ–πœ„ log π‘ž P 𝑦 = > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 +?. πœ–πΎ πœ– πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ log π‘ž P 𝑦 = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 73

  40. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 πœ– πœ– Expected reward under 𝜌 P : πœ–πœ„ log π‘ž P 𝑦 = > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 +?. πœ–πΎ πœ– πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 πœ–πœ„ log π‘ž P 𝑦 = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 74

  41. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 Expected reward under 𝜌 P : 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 πœ–πΎ πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 75

  42. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 Expected reward under 𝜌 P : Sequence of states 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 and actions when πœ–πΎ πœ– πœ–πœ„ = 𝔽 π’š~𝒒 𝜾 𝑔 𝑦 > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + following policy 𝝆 𝜾 +?. Justin Johnson December 4, 2019 Lecture 21 - 76

  43. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 Expected reward under 𝜌 P : Reward we get from 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 state sequence x πœ–πΎ πœ– πœ–πœ„ = 𝔽 Y~U V π’ˆ π’š > πœ–πœ„ log 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 77

  44. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 Gradient of predicted Expected reward under 𝜌 P : action scores with 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 respect to model πœ–πΎ 𝝐 weights. Backprop πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > 𝝐𝜾 π’Žπ’‘π’‰ 𝝆 𝜾 𝒃 𝒖 |𝒕 𝒖 through model 𝝆 𝜾 ! +?. Justin Johnson December 4, 2019 Lecture 21 - 78

  45. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 1. Initialize random weights ΞΈ Expected reward under 𝜌 P : 2. Collect trajectories x and 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 rewards f(x) using policy 𝜌 P 3. Compute dJ/dΞΈ πœ–πΎ πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + 4. Gradient ascent step on ΞΈ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 79

  46. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 1. Initialize random weights ΞΈ Expected reward under 𝜌 P : 2. Collect trajectories x and 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 rewards f(x) using policy 𝜌 P 3. Compute dJ/dΞΈ πœ–πΎ πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + 4. Gradient ascent step on ΞΈ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 80

  47. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 1. Initialize random weights ΞΈ Expected reward under 𝜌 P : 2. Collect trajectories x and 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 rewards f(x) using policy 𝜌 P 3. Compute dJ/dΞΈ πœ–πΎ πœ– πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + 4. Gradient ascent step on ΞΈ 5. GOTO 2 +?. Justin Johnson December 4, 2019 Lecture 21 - 81

  48. Policy Gradients: REINFORCE Algorithm Goal : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state Define : Let 𝑦 = 𝑑 . , 𝑏 . , 𝑑 3 , 𝑏 3 , … be the sequence of states and actions we get when following policy 𝜌 P . It’s random: 𝑦~π‘ž P 𝑦 Intuition : Expected reward under 𝜌 P : When f(x) is high: Increase the 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 probability of the actions we took. When f(x) is low: Decrease the πœ–πΎ πœ– probability of the actions we took. πœ–πœ„ = 𝔽 Y~U V 𝑔 𝑦 > πœ–πœ„ π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + +?. Justin Johnson December 4, 2019 Lecture 21 - 82

  49. So far: Q-Learning and Policy Gradients Q-Learning : Train network 𝑅 P 𝑑, 𝑏 to estimate future rewards for every (state, action) pair Use Bellman Equation to define loss function for training Q: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max S 𝑀 𝑑, 𝑏 = 𝑅 𝑑, 𝑏; πœ„ βˆ’ 𝑧 E,B,P Policy Gradients : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: WX W WP = 𝔽 Y~U V 𝑔 𝑦 βˆ‘ +?. 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 WP π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + Justin Johnson December 4, 2019 Lecture 21 - 83

  50. So far: Q-Learning and Policy Gradients Q-Learning : Train network 𝑅 P 𝑑, 𝑏 to estimate future rewards for every (state, action) pair Use Bellman Equation to define loss function for training Q: BC 𝑅(𝑑 C , 𝑏 C ; πœ„) Where 𝑠~𝑆 𝑑, 𝑏 , 𝑑 C ~𝑄 𝑑, 𝑏 𝑧 E,B,P = 𝔽 D,EC 𝑠 + 𝛿 max S 𝑀 𝑑, 𝑏 = 𝑅 𝑑, 𝑏; πœ„ βˆ’ 𝑧 E,B,P Policy Gradients : Train a network 𝜌 P 𝑏 𝑑) that takes state as input, gives distribution over which action to take in that state. Use REINFORCE Rule for computing gradients: WX W WP = 𝔽 Y~U V 𝑔 𝑦 βˆ‘ +?. 𝐾 πœ„ = 𝔽 Y~U V 𝑔 𝑦 WP π‘šπ‘π‘• 𝜌 P 𝑏 + |𝑑 + Improving policy gradients: Add baseline to reduce variance of gradient estimator Justin Johnson December 4, 2019 Lecture 21 - 84

  51. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, β€œReinforcement Learning: An Introduction”, 1998; Degris et al, β€œModel-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, β€œAsynchronous Methods for Deep Reinforcement Learning”, ICML 2016 Model-Based : Learn a model of the world’s state transition function 𝑄(𝑑 +23 |𝑑 + , 𝑏 + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, β€œAlgorithms for Inverse Reinforcement Learning”, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, β€œGenerative Adversarial Imitation Learning”, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 85

  52. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, β€œReinforcement Learning: An Introduction”, 1998; Degris et al, β€œModel-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, β€œAsynchronous Methods for Deep Reinforcement Learning”, ICML 2016 Model-Based : Learn a model of the world’s state transition function 𝑄(𝑑 +23 |𝑑 + , 𝑏 + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, β€œAlgorithms for Inverse Reinforcement Learning”, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, β€œGenerative Adversarial Imitation Learning”, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 86

  53. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, β€œReinforcement Learning: An Introduction”, 1998; Degris et al, β€œModel-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, β€œAsynchronous Methods for Deep Reinforcement Learning”, ICML 2016 Model-Based : Learn a model of the world’s state transition function 𝑄(𝑑 +23 |𝑑 + , 𝑏 + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, β€œAlgorithms for Inverse Reinforcement Learning”, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, β€œGenerative Adversarial Imitation Learning”, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 87

  54. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, β€œReinforcement Learning: An Introduction”, 1998; Degris et al, β€œModel-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, β€œAsynchronous Methods for Deep Reinforcement Learning”, ICML 2016 Model-Based : Learn a model of the world’s state transition function 𝑄(𝑑 +23 |𝑑 + , 𝑏 + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, β€œAlgorithms for Inverse Reinforcement Learning”, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, β€œGenerative Adversarial Imitation Learning”, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 88

  55. Other approaches: Model Based RL Actor-Critic : Train an actor that predicts actions (like policy gradient) and a critic that predicts the future rewards we get from taking those actions (like Q-Learning) Sutton and Barto, β€œReinforcement Learning: An Introduction”, 1998; Degris et al, β€œModel-free reinforcement learning with continuous action in practice”, 2012; Mnih et al, β€œAsynchronous Methods for Deep Reinforcement Learning”, ICML 2016 Model-Based : Learn a model of the world’s state transition function 𝑄(𝑑 +23 |𝑑 + , 𝑏 + ) and then use planning through the model to make decisions Imitation Learning : Gather data about how experts perform in the environment, learn a function to imitate what they do (supervised learning approach) Inverse Reinforcement Learning : Gather data of experts performing in environment; learn a reward function that they seem to be optimizing, then use RL on that reward function Ng et al, β€œAlgorithms for Inverse Reinforcement Learning”, ICML 2000 Adversarial Learning : Learn to fool a discriminator that classifies actions as real/fake Ho and Ermon, β€œGenerative Adversarial Imitation Learning”, NeurIPS 2016 Justin Johnson December 4, 2019 Lecture 21 - 89

  56. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, β€œMastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, β€œMastering the game of Go without human knowledge”, Nature 2017 Silver et al, β€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 This image is CC0 public domain Schrittwieser et al, β€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 90

  57. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, β€œMastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, β€œMastering the game of Go without human knowledge”, Nature 2017 Silver et al, β€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 This image is CC0 public domain Schrittwieser et al, β€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 91

  58. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, β€œMastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, β€œMastering the game of Go without human knowledge”, Nature 2017 Silver et al, β€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 This image is CC0 public domain Schrittwieser et al, β€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 92

  59. Case Study: Playing Games AlphaGo : (January 2016) - Used imitation learning + tree search + RL - Beat 18-time world champion Lee Sedol AlphaGo Zero (October 2017) - Simplified version of AlphaGo - No longer using imitation learning - Beat (at the time) #1 ranked Ke Jie Alpha Zero (December 2018) - Generalized to other games: Chess and Shogi MuZero (November 2019) - Plans through a learned model of the game Silver et al, β€œMastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, β€œMastering the game of Go without human knowledge”, Nature 2017 Silver et al, β€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 This image is CC0 public domain Schrittwieser et al, β€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 93

  60. Case Study: Playing Games November 2019: Lee Sedol announces retirement AlphaGo : (January 2016) - Used imitation learning + tree search + RL β€œWith the debut of AI - Beat 18-time world champion Lee Sedol in Go games, I've AlphaGo Zero (October 2017) realized that I'm not at - Simplified version of AlphaGo the top even if I become the number - No longer using imitation learning one through frantic - Beat (at the time) #1 ranked Ke Jie efforts” Alpha Zero (December 2018) β€œEven if I become the - Generalized to other games: Chess and Shogi number one, there is MuZero (November 2019) an entity that cannot - Plans through a learned model of the game be defeated” Silver et al, β€œMastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, β€œMastering the game of Go without human knowledge”, Nature 2017 Quotes from: https://en.yna.co.kr/view/AEN20191127004800315 Silver et al, β€œA general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 Image of Lee Sedol is licensed under CC BY 2.0 Schrittwieser et al, β€œMastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019 Justin Johnson December 4, 2019 Lecture 21 - 94

  61. More Complex Games StarCraft II: AlphaStar Dota 2 : OpenAI Five (April 2019) (October 2019) No paper, only a blog post: Vinyals et al, β€œGrandmaster https://openai.com/five/#how- level in StarCraft II using openai-five-works multi-agent reinforcement learning”, Science 2018 Justin Johnson December 4, 2019 Lecture 21 - 95

  62. Reinforcement Learning: Interacting With World Ac#on Agent Environment Reward Normally we use RL to train agents that interact with a (noisy, nondifferentiable) environment Justin Johnson December 4, 2019 Lecture 21 - 96

  63. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Justin Johnson December 4, 2019 Lecture 21 - 97

  64. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small β€œrouting” network sends image to one of K networks CNN CNN CNN CNN Justin Johnson December 4, 2019 Lecture 21 - 98

  65. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small β€œrouting” network sends image to one of K networks CNN Which network to use? P(orange) = 0.2 CNN CNN P(blue) = 0.1 P(green) = 0.7 CNN Justin Johnson December 4, 2019 Lecture 21 - 99

  66. Reinforcement Learning: Stochastic Computation Graphs Can also use RL to train neural networks with nondifferentiable components! Example: Small β€œrouting” network sends image to one of K networks CNN Which network to use? Sample: P(orange) = 0.2 CNN Green CNN P(blue) = 0.1 P(green) = 0.7 CNN Justin Johnson December 4, 2019 Lecture 21 - 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend