cmp784
play

CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning - PowerPoint PPT Presentation

DeepLoco by X. B. Peng, G. Berseth & M. van de Panne CMP784 DEEP LEARNING Lecture #12 Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2018 Neural Face by Taehoon Kim Previously on CMP784 Generative


  1. DeepLoco by X. B. Peng, G. Berseth & M. van de Panne CMP784 DEEP LEARNING Lecture #12 – Deep Reinforcement Learning Aykut Erdem // Hacettepe University // Spring 2018

  2. Neural Face by Taehoon Kim Previously on CMP784 • Generative Adversarial Networks (GANs) • How do GANs work • Conditional GAN • Tips and Tricks • Applications 2

  3. Lecture overview • What is Reinforcement Learning? • Components of a RL problem • Markov Decision Processes • Value-Based Deep RL • Policy-Based Deep RL • Model-Based Deep RL r: Much of the material and slides for this lecture were borrowed from Disclaimer: — John Schulman’s talk on “Deep Reinforcement Learning: Policy Gradients and Q-Learning” — David Silver’s tutorial on “Deep Reinforcement Learning” — Lex Fridman’s MIT 6.S094 Deep Learning for Self-Driving Cars class 3

  4. What is Reinforcement Learning? 4

  5. What is Reinforcement Learning? Supervised Learning Unsupervised Learning Reinforcement Learning Slide credit: Razvan Pascanu 5

  6. What is Reinforcement Learning? • Branch of machine learning concerned with taking sequences of actions • Usually described in terms of agent interacting with a previously unknown environment, trying to maximize cumulative reward action Agent Environment observation, reward 6

  7. Motor Control and Robotics Robotics: • Observations: camera images, joint angles • Actions: joint torques • Rewards: stay balanced, navigate to target locations, serve and protect humans 7

  8. Business Operations Inventory Management: • Observations: current inventory levels • Actions: number of units of each item to purchase • Rewards: profit 8

  9. Image Captioning Hard Attention for Image Captioning: • Observations: current image window • Actions: where to look • Rewards: classification 9

  10. Why is Go hard for computers to play? Game tree complexity = b d Games Brute force search intractable: 1. Search space is huge A different kind of optimization problem (min-max) 2. “Impossible” for computers to evaluate who is winning but still considered to be RL. • Go (complete information, deterministic) – AlphaGo • Backgammon (complete information, stochastic) – TD-Gammon • Stratego (incomplete information, deterministic) • Poker (incomplete information, stochastic) Matej Morav č ík et al. DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science 02 Mar 2017 David Silver, Aja Huang, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. Gerald Tesauro. “Temporal difference learning and TD-Gammon”. In: Communications of the ACM 38.3 (1995), pp. 58–68. 10

  11. How does RL relate to Supervised Learning? • Supervised learning: — Environment samples input-output pair pair ( x t , y t ) ∼ ⇢ — Agent predicts ts ˆ y t = f ( x t ) — Agent receives loss ss ` ( y t , ˆ y t ) — Environment asks agent a question, and then tells her the right answer 11

  12. How does RL relate to Supervised Learning? • Reinforcement learning: − Environment samples input-output pair input x t ∼ P ( x t | x t − 1 , y t − 1 ) § Input depends on your previous actions! − Agent predicts ts ˆ y t = f ( x t ) − Agent receives cost where P a probability distribution st c t ∼ P ( c t | x t , ˆ y t ) w unknown to the agent. 12

  13. Reinforcement Learning in a nutshell RL is a general-purpose framework for decision-making • RL is for an ag ent with the capacity to ac agen act action influences the agent’s future st • Each ac state • Success is measured by a scalar re rd signal reward • Goal: se select ct act ctions ons to o maximize fut uture ure re reward rd 14

  14. Deep Learning in a nutshell RL is a general-purpose framework for decision-making • Given an obj object ctive ion that is required to achieve objective • Learn re repre rese sentatio • Directly from ra raw in inputs • Using minimal domain knowledge 15

  15. Deep Reinforcement Learning: AI = RL + DL We seek a single agent which can solve any human-level task • RL defines the objective • DL gives the mechanism • RL + DL = general intelligence • Examples: Play games: Atari, poker, Go, ... − Pla plore worlds: 3D worlds, Labyrinth, ... − Ex Explor Control physical systems: manipulate, walk, swim, ... − Co ct with users: recommend, optimize, personalize, ... − Interact 16

  16. Agent and Environment • At each step t the agent: • Executes action a t observation action • Receives observation o t o t a t • Receives scalar reward r t reward r t • The environment: • Receives action a t • Emits observations o t+1 • Emits scalar reward r t+1 17

  17. Example Reinforcement Learning Problem • • • An agent operates in an environment: At Atari Breakout out • • • An agent has the capacity to ac act • Each action influences the agent’s • Each action influences the agent’s • Each action influences the agent’s fu futu ture sta tate te • • • rd signal • Success is measured by a re reward • Goal is to select actions to maximize future • Go reward 18

  18. State • Experience is a sequence of observations, actions, rewards o 1 , r 1 , a 1 , ..., a t-1 , o t , r t • The state is a summary of experience s t = f(o 1 ,r 1 ,a 1 ,...,a t-1 ,o t ,r t ) • In a fully observed environment s t =f(o t ) 19

  19. Major Components of an RL Agent • An RL agent may include one or more of these components: − Policy: Agent’s behavior function − Value function: How good is each state and/or action − Model: Agent’s representation of the environment 20

  20. Policy • A policy is the agent’s behavior • It is a map from state to action: − Deterministic policy: : a = π ( s ) P − Stochastic policy: : π ( a | s ) = P [ a | s ] 21

  21. Value Function • A value function is a prediction of future reward − “How much reward will I get from action a in state s ?” • Q -value function gives expected total reward − from state s and action a − under policy y π or γ − with discount factor ⇥ r t +1 + γ r t +2 + γ 2 r t +3 + ... | s , a ⇥ ⇤ Q π ( s , a ) = E • Value functions decompose into a Bellman equation Q π ( s , a ) = E s 0 , a 0 ⇥ ⇤ r + γ Q π ( s 0 , a 0 ) | s , a 22

  22. Optimal Value Functions • An optimal value function is the maximum achievable value Q π ( s , a ) = Q π ⇤ ( s , a ) Q ⇤ ( s , a ) = max π • Once we have Q * we can act optimally, π ⇤ ( s ) = argmax Q ⇤ ( s , a ) a • Optimal value maximizes over all decisions. Informally: a t +1 r t +2 + γ 2 max Q ⇤ ( s , a ) = r t +1 + γ max a t +2 r t +3 + ... a t +1 Q ⇤ ( s t +1 , a t +1 ) = r t +1 + γ max • Formally, optimal values decompose into a Bellman equation  � Q ⇤ ( s , a ) = E s 0 Q ⇤ ( s 0 , a 0 ) | s , a r + γ max a 0 23

  23. Model observation action o t a t reward r t 24

  24. Model • Model is learnt from experience • Acts as proxy for environment • Planner interacts with model • e.g. using lookahead search 25

  25. Approaches To Reinforcement Learning • Value-based RL − Estimate the optimal value function Q * ( s , a ) − This is the maximum value achievable under any policy • Policy-based RL − Search directly for the optimal policy y π ∗ − This is the policy achieving maximum future reward • Model-based RL − Build a model of the environment − Plan (e.g. by lookahead) using model 26

  26. Deep Reinforcement Learning • Use deep neural networks to represent − Value function − Policy − Model • Optimize loss function by stochastic gradient descent 27

  27. Value-Based Deep RL 28

  28. Q-Networks • Represent value function by Q-network with weights w Q ( s , a , w ) ≈ Q ∗ ( s , a ) … Q(s,a, w ) Q(s,a 1 , w ) Q(s,a m , w ) w w s a s 29

  29. Q-Learning • Optimal Q-values should obey Bellman equation  � Q ⇤ ( s , a ) = E s 0 Q ⇤ ( s 0 , a 0 ) | s , a r + γ max a 0 • Treat right-handside as a target e r + γ max Q ( s 0 , a 0 , w ) a 0 • Minimize MSE loss by stochastic gradient descent ◆ 2 ✓ Q ( s 0 , a 0 , w ) − Q ( s , a , w ) l = r + γ max a 0 • Converges to Q * using table lookup representation • But diverges using neural networks due to: − Correlations between samples − Non-stationary targets 30

  30. Deep Q-Networks (DQN): Experience Replay • To remove correlations, build data-set from agent’s own experience s 1 , a 1 , r 2 , s 2 s , a , r , s 0 s 2 , a 2 , r 3 , s 3 → s 3 , a 3 , r 4 , s 4 ... s t , a t , r t +1 , s t +1 s t , a t , r t +1 , s t +1 → • Sample experiences from data-set and apply update ◆ 2 ✓ Q ( s 0 , a 0 , w � ) − Q ( s , a , w ) l = r + γ max a 0 • To deal with non-stationarity, target parameters w — are held fixed 31

  31. Deep Reinforcement Learning in Atari state action s t a t reward r t Human level control through deep reinforcement learning, V. Mnih et al. Nature 518:529-533, 2015. 32

  32. DQN in Atari • End-to-end learning of values Q ( s , a ) from pixels s • Input state s is stack of raw pixels from last 4 frames • Output is Q ( s , a ) for 18 joystick/button positions • Reward is change in score for that step Network architecture and hyperparameters fixed across all games 33

  33. DQN Results in Atari 34

  34. 35

  35. 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend