outline general introduction basic settings
play

Outline General introduction Basic - PowerPoint PPT Presentation

Outline General introduction Basic settings Tabular approach Deep reinforcement learning Challenges and opportunities Appendix: selected applications General Introduction


  1. Recall: Bellman Optimality Equation โ€ข Optimal value function โ€ข Optimal state-value function: ๐‘ค โˆ— ๐‘ก = max ๐‘ค ๐œŒ ๐‘ก ๐œŒ โ€ข Optimal action-value function: ๐‘Ÿ โˆ— ๐‘ก, ๐‘ = max ๐‘Ÿ ๐œŒ ๐‘ก, ๐‘ ๐œŒ โ€ข Bellman optimality equation โ€ข ๐‘ค โˆ— ๐‘ก = max ๐‘Ÿ โˆ— ๐‘ก, ๐‘ ๐‘ ๐‘ ๐‘ค โˆ— (๐‘ก โ€ฒ ) ๐‘ + ๐›ฟ ฯƒ ๐‘ก โ€ฒ ๐‘„ ๐‘ก๐‘ก โ€ฒ โ€ข ๐‘Ÿ โˆ— ๐‘ก, ๐‘ = ๐‘† ๐‘ก

  2. Planning (Optimal Control) Given an exact model (i.e., reward function, transition probabilities) Value iteration with Bellman optimality equation : Arbitrary initialization: ๐‘Ÿ 0 For ๐‘™ = 0,1,2, โ€ฆ โˆ€๐‘ก โˆˆ ๐‘‡, ๐‘ โˆˆ ๐ต ๐‘Ÿ ๐‘™+1 ๐‘ก, ๐‘ = ๐‘  ๐‘ก, ๐‘ + ๐›ฟ ฯƒ ๐‘ก โ€ฒ โˆˆ๐‘‡ ๐‘„ ๐‘ก โ€ฒ ๐‘ก, ๐‘ max ๐‘ โ€ฒ ๐‘Ÿ ๐‘™ (๐‘ก โ€ฒ , ๐‘โ€ฒ) Stopping criterion: max ๐‘กโˆˆ๐‘‡,๐‘โˆˆ๐ต ๐‘Ÿ ๐‘™+1 ๐‘ก, ๐‘ โˆ’ ๐‘Ÿ ๐‘™ ๐‘ก, ๐‘ โ‰ค ๐œ—

  3. Learning in MDPs โ€ข Have access to the real system but no model โ€ข Generate experience ๐‘ 1 , ๐‘ 1 , ๐‘  1 , ๐‘ 2 , ๐‘ 2 , ๐‘  2 , โ€ฆ , ๐‘ ๐‘ขโˆ’1 , ๐‘ ๐‘ขโˆ’1 , ๐‘  ๐‘ขโˆ’1 , ๐‘ ๐‘ข โ€ข Two kinds of approaches โ€ข Model-free learning โ€ข Model-based learning

  4. Monte-Carlo Policy Evaluation โ€ข To evaluate state ๐‘ก โ€ข The first time-step ๐‘ข that state ๐‘ก is visited in an episode, โ€ข Increment counter ๐‘‚(๐‘ก) = ๐‘‚(๐‘ก) + 1 โ€ข Increment total return ๐‘‡ ๐‘ก = ๐‘‡(๐‘ก) + ๐ป ๐‘ข ๐‘‡ ๐‘ก โ€ข Value is estimated by mean return ๐‘Š ๐‘ก = ๐‘‚ ๐‘ก โ€ข By law of large numbers, ๐‘Š ๐‘ก โ†’ ๐‘ค ๐œŒ ๐‘ก ๐‘๐‘ก ๐‘‚ ๐‘ก โ†’ โˆž

  5. Incremental Monte-Carlo Update ๐‘™ ๐‘™โˆ’1 ๐œˆ ๐‘™ = 1 ๐‘ฆ ๐‘˜ = 1 ๐‘™ เท ๐‘ฆ ๐‘™ + เท ๐‘ฆ ๐‘˜ ๐‘™ ๐‘˜=1 ๐‘˜=1 = 1 ๐‘™ ๐‘ฆ ๐‘™ + ๐‘™ โˆ’ 1 ๐œˆ ๐‘™โˆ’1 = ๐œˆ ๐‘™โˆ’1 + 1 ๐‘™ ๐‘ฆ ๐‘™ โˆ’ ๐œˆ ๐‘™โˆ’1 For each state ๐‘ก with return ๐ป ๐‘ข : ๐‘‚ ๐‘ก โ† ๐‘‚ ๐‘ก + 1 1 ๐‘Š ๐‘ก โ† ๐‘Š ๐‘ก + ๐‘‚ ๐‘ก (๐ป ๐‘ข โˆ’ ๐‘Š ๐‘ก ) Handle non-stationary problem: ๐‘Š ๐‘ก โ† ๐‘Š ๐‘ก + ๐›ฝ(๐ป ๐‘ข โˆ’ ๐‘Š ๐‘ก )

  6. Monte-Carlo Policy Evaluation ๐‘ค ๐‘ก ๐‘ข โ† ๐‘ค ๐‘ก ๐‘ข + ๐›ฝ ๐ป ๐‘ข โˆ’ ๐‘ค ๐‘ก ๐‘ข ๐ป ๐‘ข is the actual long-term return following state ๐‘ก ๐‘ข in a sampled trajectory

  7. Monte-Carlo Reinforcement Learning โ€ข MC methods learn directly from episodes of experience โ€ข MC is model-free: no knowledge of MDP transitions / rewards โ€ข MC learns from complete episodes โ€ข Values for each state or pair state-action are updated only based on final reward, not on estimations of neighbor states โ€ข MC uses the simplest possible idea: value = mean return โ€ข Caveat: can only apply MC to episodic MDPs โ€ข All episodes must terminate

  8. Temporal-Difference Policy Evaluation Monte-Carlo : ๐‘ค ๐‘ก ๐‘ข โ† ๐‘ค ๐‘ก ๐‘ข + ๐›ฝ ๐ป ๐‘ข โˆ’ ๐‘ค ๐‘ก ๐‘ข TD: ๐‘ค ๐‘ก ๐‘ข โ† ๐‘ค ๐‘ก ๐‘ข + ๐›ฝ ๐‘  ๐‘ข+1 + ๐›ฟ๐‘ค(๐‘ก ๐‘ข+1 ) โˆ’ ๐‘ค ๐‘ก ๐‘ข ๐‘  ๐‘ข is the actual immediate reward following state ๐‘ก ๐‘ข in a sampled step

  9. Temporal-Difference Policy Evaluation โ€ข TD methods learn directly from episodes of experience โ€ข TD is model-free: no knowledge of MDP transitions / rewards โ€ข TD learns from incomplete episodes, by bootstrapping โ€ข TD updates a guess towards a guess โ€ข Simplest temporal-difference learning algorithm: TD(0) โ€ข Update value ๐‘ค ๐‘ก ๐‘ข toward estimated return ๐‘  ๐‘ข+1 + ๐›ฟ๐‘ค ๐‘ก ๐‘ข+1 ๐‘ค ๐‘ก ๐‘ข = ๐‘ค ๐‘ก ๐‘ข + ๐›ฝ(๐‘  ๐‘ข+1 + ๐›ฟ๐‘ค ๐‘ก ๐‘ข+1 โˆ’ ๐‘ค ๐‘ก ๐‘ข ) โ€ข ๐‘  ๐‘ข+1 + ๐›ฟ๐‘ค ๐‘ก ๐‘ข+1 is called the TD target โ€ข ๐œ€ ๐‘ข = ๐‘  ๐‘ข+1 + ๐›ฟ๐‘ค ๐‘ก ๐‘ข+1 โˆ’ ๐‘ค ๐‘ก ๐‘ข is called the TD error

  10. Comparisons TD MC DP

  11. Policy Improvement

  12. Policy Iteration

  13. ๐œ— -greedy Exploration

  14. Monte-Carlo Policy Iteration

  15. Monte-Carlo Control

  16. MC vs TD Control โ€ข Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC) โ€ข Lower variance โ€ข Online โ€ข Incomplete sequences โ€ข Natural idea: use TD instead of MC in our control loop โ€ข Apply TD to Q ( S; A ) โ€ข Use ๐œ— -greedy policy improvement โ€ข Update every time-step

  17. Model-based Learning โ€ข Use experience data to estimate model โ€ข Compute optimal policy w.r.t the estimated model

  18. Summary to RL Planning Policy evaluation For a fixed policy Value iteration, policy iteration Optimal control Optimize Policy Model-free learning Policy evaluation For a fixed policy Monte-carlo, TD learning Optimal control Optimize Policy Model-based learning

  19. Large Scale RL โ€ข So far we have represented value function by a lookup table โ€ข Every state ๐‘ก has an entry ๐‘ค(๐‘ก) โ€ข Or every state-action pair ๐‘ก, ๐‘ has an entry ๐‘Ÿ(๐‘ก, ๐‘) โ€ข Problem with large MDPs: โ€ข Too many states and/or actions to sore in memory โ€ข Too slow to learn the value of each state (action pair) individually โ€ข Backgammon: 10 20 states โ€ข Go: 10 170 states

  20. Solution: Function Approximation for RL โ€ข Estimate value function with function approximation โ€ข เทœ ๐‘ค ๐‘ก; ๐œ„ โ‰ˆ ๐‘ค ๐œŒ ๐‘ก or เทœ ๐‘Ÿ ๐‘ก, ๐‘; ๐œ„ โ‰ˆ ๐‘Ÿ ๐œŒ (๐‘ก, ๐‘) โ€ข Generalize from seen states to unseen states โ€ข Update parameter ๐œ„ using MC or TD learning โ€ข Policy function โ€ข Model transition function

  21. Deep Reinforcement Learning Deep learning . Value based . Policy gradients Actor-critic . Model based

  22. Deep Learning Is Making Break-through! ไบบๅทฅๆ™บ่ƒฝๆŠ€ๆœฏๅœจ้™ๅฎšๅ›พๅƒ็ฑปๅˆซ ็š„ๅฐ้—ญ่ฏ•้ชŒไธญ๏ผŒไนŸๅทฒ็ป่พพๅˆฐๆˆ– ่ถ…่ฟ‡ไบ†ไบบ็ฑป็š„ๆฐดๅนณ 2016 ๅนด 10 ๆœˆ๏ผŒๅพฎ่ฝฏ็š„่ฏญ้Ÿณ่ฏ†ๅˆซ็ณป็ปŸๅœจ ๆ—ฅๅธธๅฏน่ฏๆ•ฐๆฎไธŠ๏ผŒ่พพๅˆฐไบ† 5.9% ็š„ๅ• ่ฏ้”™่ฏฏ็Ž‡๏ผŒ้ฆ–ๆฌกๅ–ๅพ—ไธŽไบบ็ฑป็›ธๅฝ“็š„ ่ฏ†ๅˆซ็ฒพๅบฆ

  23. Deep Learning Deep learning ( deep machine learning , or deep structured learning , or hierarchical learning , or sometimes DL ) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures or otherwise, composed of multiple non-linear transformations. 2012 : Distributed deep learning 2015 : Open source tools: MxNet, 1974 : Backpropagation 1997 : LSTM-RNN (e.g., Google Brain) TensorFlow, CNTK t 1958 : Birth of Late 1980s : convolution neural 2006 : Unsupervised pretraining 2013 : DQN for deep Perceptron and neural networks (CNN) and recurrent neural for deep neutral networks reinforcement learning networks networks (RNN) trained using backpropagation

  24. Driving Power โ€ข โ€ข โ€ข Big data: web pages, search logs, Deep models: 1000+ layers, tens Big computer clusters: CPU social networks, and new of billions of parameters clusters, GPU clusters, FPGA farms, mechanisms for data collection: provided by Amazon, Azure, etc. conversation and crowdsourcing

  25. Value based methods: estimate value function or Q-function of the optimal policy (no explicit policy)

  26. Nature 2015 Human Level Control Through Deep Reinforcement Learning

  27. Human-level Control Through Deep Reinforcement Learning Representations of Atari Games โ€ข End-to-end learning of values ๐‘…(๐‘ก, ๐‘) from pixels ๐‘ก โ€ข Input state ๐‘ก is stack of raw pixels from last 4 frames โ€ข Output is ๐‘…(๐‘ก, ๐‘) for 18 joystick/button positions โ€ข Reward is change in score for that step

  28. Value Iteration with Q-Learning โ€ข Represent value function by deep Q-network with weights ๐œ„ ๐‘… ๐‘ก, ๐‘; ๐œ„ โ‰ˆ ๐‘… ๐œŒ (๐‘ก, ๐‘) โ€ข Define objective function by mean-squared error in Q-values 2 ๐‘ โ€ฒ ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ ๐‘… ๐‘ก, ๐‘; ๐œ„ ๐‘€ ๐œ„ = ๐น ๐‘  + ๐›ฟmax โ€ข Leading to the following Q-learning gradient ๐œ–๐‘€ ๐œ„ ๐œ–๐‘… ๐‘ก, ๐‘; ๐œ„ ๐‘ โ€ฒ ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ ๐‘… ๐‘ก, ๐‘; ๐œ„ = ๐น ๐‘  + ๐›ฟmax ๐œ–๐œ„ ๐œ–๐œ„ โ€ข Optimize objective end-to-end by SGD

  29. Stability Issues with Deep RL Naive Q-learning oscillates or diverges with neural nets โ€ข Data is sequential โ€ข Successive samples are correlated, non-iid โ€ข Policy changes rapidly with slight changes to Q-values โ€ข Policy may oscillate โ€ข Distribution of data can swing from one extreme to another

  30. Deep Q-Networks โ€ข DQN provides a stable solution to deep value-based RL โ€ข Use experience replay โ€ข Break correlations in data, bring us back to iid setting โ€ข Learn from all past policies โ€ข Using off-policy Q-learning โ€ข Freeze target Q-network โ€ข Avoid oscillations โ€ข Break correlations between Q-network and target

  31. Deep Q-Networks: Experience Replay To remove correlations, build data-set from agent's own experience โ€ข Take action at according to ๐œ— -greedy policy โ€ข Store transition (๐‘ก ๐‘ข , ๐‘ ๐‘ข , ๐‘  ๐‘ข+1 , ๐‘ก ๐‘ข+1 ) in replay memory D โ€ข Sample random mini-batch of transitions (๐‘ก, ๐‘, ๐‘ , ๐‘กโ€ฒ) from D โ€ข Optimize MSE between Q-network and Q-learning targets, e.g. 2 ๐‘ โ€ฒ ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ ๐‘… ๐‘ก, ๐‘; ๐œ„ ๐‘€ ๐œ„ = ๐น ๐‘ก,๐‘,๐‘ ,๐‘ก โ€ฒ โˆผ๐ธ ๐‘  + ๐›ฟmax

  32. Deep Q-Networks: Fixed target network To avoid oscillations, fix parameters used in Q-learning target โ€ข Compute Q-learning targets w.r.t. old, fixed parameters ๐œ„ โˆ’ ๐‘ โ€ฒ ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ ๐‘  + ๐›ฟmax โ€ข Optimize MSE between Q-network and Q-learning targets 2 ๐‘ โ€ฒ ๐‘… ๐‘ก โ€ฒ , ๐‘ โ€ฒ ; ๐œ„ โˆ’ โˆ’ ๐‘… ๐‘ก, ๐‘; ๐œ„ ๐‘€ ๐œ„ = ๐น ๐‘ก,๐‘,๐‘ ,๐‘ก โ€ฒ โˆผ๐ธ ๐‘  + ๐›ฟmax โ€ข Periodically update fixed parameters ๐œ„ โˆ’ โ† ๐œ„

  33. Of 49 Atari games 43 games are better than state-of-art results Experiment 29 games achieves 75% expert score

  34. Other Tricks โ€ข DQN clips the rewards to [ - 1 ; +1] โ€ข This prevents Q-values from becoming too large โ€ข Ensures gradients are well-conditioned โ€ข Canโ€™t tell difference between small and large rewards โ€ข Better approach: normalize network output โ€ข e.g. via batch normalization

  35. Extensions โ€ข Deep Recurrent Q-Learning for Partially Observable MDPs โ€ข Use CNN + LSTM instead of CNN to encode frames of images โ€ข Deep Attention Recurrent Q-Network โ€ข Use CNN + LSTM + Attention model to encode frames of images

  36. Policy gradients: directly differentiate the objective

  37. Gradient Computation

  38. Policy Gradients โ€ข Optimization Problem: Find ๐œ„ that maximizes expected total reward. โ€ข The gradient of a stochastic policy ๐œŒ ฮธ (๐‘|๐‘ก) is given by โ€ข The gradient of a deterministic policy ๐‘ = ๐œˆ ฮธ ๐‘ก is given by โ€ข Gradient tries to โ€ข Increase probability of paths with positive R โ€ข Decrease probability of paths with negative R

  39. REINFORCE โ€ข We use return ๐‘ค ๐‘ข as an unbiased sample of Q. โ€ข ๐‘ค ๐‘ข = ๐‘  1 + ๐‘  2 + โ‹ฏ + ๐‘  ๐‘ข โ€ข high variance โ€ข limited for stochastic case

  40. Actor-critic: estimate value function or Q-function of the current policy, use it to improve policy

  41. Actor-Critic โ€ข We use a critic to estimate the action- value function โ€ข Actor-critic algorithms โ€ข Updates action-value function parameters โ€ข Updates policy parameters ฮธ, in direction suggested by critic

  42. Review โ€ข Value Based โ€ข Learnt Value Function โ€ข Implicit policy โ€ข (e.g. ๐œ— -greedy) โ€ข Policy Based โ€ข No Value Function โ€ข Learnt Policy โ€ข Actor-Critic โ€ข Learnt Value Function โ€ข Learnt Policy

  43. Model based DRL โ€ข Learn a transition model of the environment/system ๐‘„(๐‘ , ๐‘ก โ€ฒ |๐‘ก, ๐‘) โ€ข Using deep network to represent the model โ€ข Define loss function for the model โ€ข Optimize the loss by SGD or its variants โ€ข Plan using the transition model โ€ข E.g., lookahead using the transition model to find optimal actions

  44. Model based DRL: Challenges โ€ข Errors in the transition model compound over the trajectory โ€ข By the end of a long trajectory, rewards can be totally wrong โ€ข Model-based RL has failed in Atari

  45. Challenges and Opportunities

  46. 1. Robustness โ€“ random seeds

  47. 1. Robustness โ€“ random seeds Deep Reinforcement Learning that Matters, AAAI18

  48. 2. Robustness โ€“ across task Deep Reinforcement Learning that Matters, AAAI18

  49. โ€ข ResNet performs pretty well on various kinds of tasks โ€ข Object detection As a โ€ข Image segmentation Comparison โ€ข Go playing โ€ข Image generation โ€ข โ€ฆ

  50. 3. Learning - sample efficiency โ€ข Supervised learning โ€ข Learning from oracle โ€ข Reinforcement learning โ€ข Learning from trial and error Rainbow: Combining Improvements in Deep Reinforcement Learning

  51. Multi-task/transfer learning โ€ข Humans canโ€™t learn individual complex tasks from scratch. โ€ข Maybe our agents shouldnโ€™t either. โ€ข We ultimately want our agents to learn many tasks in many environments โ€ข learn to learn new tasks quickly (Duan et al. โ€™17, Wang et al. โ€™17, Finn et al. ICML โ€™17) โ€ข share information across tasks in other ways (Rusu et al. NIPS โ€™16, Andrychowicz et al. โ€˜17, Cabi et al. โ€™17, Teh et al. โ€™17) โ€ข Better exploration strategies

  52. 4. Optimization โ€“ local optima

  53. 5. No/sparse reward Real world interaction: โ€ข Usually no (visible) immediate reward for each action โ€ข Maybe no (visible) explicit final reward for a sequence of actions โ€ข Donโ€™t know how to terminate a sequence Consequences: โ€ข Most DRL algos are for games or robotics โ€ข Reward information is defined by video games in Atari and Go โ€ข Within controlled environments

  54. โ€ข Scalar reward is an extremely sparse signal, while at the same time, humans can learn without any external rewards. โ€ข Self-supervision (Osband et al. NIPS โ€™16, Houthooft et al. NIPS โ€™16, Pathak et al. ICML โ€™17, Fu*, Co - Reyes* et al. โ€˜17, Tang et al. ICLR โ€™17, Plappert et al. โ€˜17) โ€ข options & hierarchy (Kulkarni et al. NIPS โ€™16, Vezhnevets et al. NIPS โ€™16, Bacon et al. AAAI โ€™16, Heess et al. โ€˜17, Vezhnevets et al. ICML โ€™17, Tessler et al. AAAI โ€™17) โ€ข leveraging stochastic policies for better exploration (Florensa et al. ICLR โ€™17, Haarnoja et al. ICML โ€™17) โ€ข auxiliary objectives (Jaderberg et al. โ€™17, Shelhamer et al. โ€™17, Mirowski et al. ICLR โ€™17)

  55. 6. Is DRL a good choice for a task?

  56. 7. Imperfect-information games and multi-agent games โ€ข No-limit heads up Texas Holdโ€™Em โ€ข Libratus (Brown et al, NIPS 2017) โ€ข DeepStack ( Moravฤรญk et al, 2017) Refer to Prof. Bo Anโ€™s talk

  57. Improve robustness (e.g., w.r.t random seeds and across tasks) Improve learning efficiency Better optimization Opportunities Define reward in practical applications Identify appropriate tasks Imperfect information and multi-agent games

  58. Applications

  59. Game Neuro Science Music & Movie Healthcare NLP Trading Robotics Education Control

  60. Game โ€ข RL for Game โ€ข Sequential Decision Making โ€ข Delayed Reward TD-Gammon Atari Games

  61. Game โ€ข Atari Games โ€ข Learned to play 49 games for the Atari 2600 game console, without labels or human input, from self-play and the score alone โ€ข Learned to play better than all previous algorithms and at human level for more than half the games Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.

  62. Game โ€ข AlphaGo 4-1 CNN โ€ข Master(AlphaGo++) 60-0 ) Value Network Policy Network http://icml.cc/2016/tutorials/AlphaGo-tutorial-slides.pdf

  63. Game Neuro Science Music & Movie Healthcare NLP Trading Robotics Education Control

  64. Neuro Science The world presents animals/humans with a huge reinforcement learning problem (or many such small problems)

  65. Neuro Science โ€ข How can the brain realize these? Can RL help us understand the brainโ€™s computations? โ€ข Reinforcement learning has revolutionized our understanding of learning in the brain in the last 20 years. โ€ข A success story: Dopamine and prediction errors Yael Niv. The Neuroscience of Reinforcement Learning. Princeton University. ICMLโ€™09 Tutorial

  66. What is dopamine? โ€ข Parkinsonโ€™s Disease โ€ข Plays a major role in reward-motivated behavior as a โ€œglobal reward signalโ€ โ€ข Gambling โ€ข Regulating attention โ€ข Pleasure

  67. Conditioning โ€ข Pavlovโ€™s Dog

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend