Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective - PowerPoint PPT Presentation

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning arXiv:1702.08165 Reinforcement Learning with Deep Energy-Based Policies 1 / 25

Reinforcement Learning Environment Action Reward Interpreter State Agent 2 / 25

Reinforcement Learning A framework for modeling intelligent agents. An agent takes an action depending on its state to change the environment with the goal of maximizing their reward . 3 / 25

Reinforcement Learning ◮ bandits / Markov decision process (MDP) ◮ episodes and discounts ◮ model-based RL / model-free RL ◮ single-agent / multi-agent ◮ tabular RL / Deep RL (parameterized policies) ◮ discrete / continuous ◮ on-policy / off-policy learning ◮ policy gradients / Q-learning 4 / 25

Markov decision process (MDP) S 0 S 1 a 1 a 0 a 0 a 1 a 0 a 1 S 2 5 / 25

Markov decision process (MDP) ◮ states s ∈ S ◮ actions a ∈ A ◮ transition probability p ( s ′ | s , a ) ◮ rewards r ( s ), r ( s , a ), or r ( s , a , s ′ ) It’s Markov because the transition s t → s t +1 only depends on s t . It’s a decision process because it depends on a . Goal is to find policy π ( a | s ) that maximizes reward over time. 6 / 25

Multi-armed bandits 7 / 25

Multi-armed bandits r(a 0 ) a 0 S a 1 r(a 1 ) Want to learn p ( r | a ) and maximize � r � . Tradeoff between exploit and explore . 8 / 25

Episodic RL Agent either acts until a terminal state is reached. s 0 ∼ µ ( s 0 ) a 0 ∼ π ( a 0 | s 0 ) r 0 = r ( s 0 , a 0 ) s 1 ∼ p ( s 1 | , s 0 , a 0 ) . . . a T − 1 ∼ π ( a T − 1 | s T − 1 ) r T − 1 = r ( s T − 1 , a T − 1 ) s T ∼ p ( s T − 1 | , s T − 1 , a T − 1 ) The goal is to maximize total rewards η ( π ) = E [ r 0 + r 1 + · · · + r T − 1 ] 9 / 25

Discount factor If there are no terminal states, the episode lasts “forever” and the agent takes “infinite” actions. In this case, we maximize discounted total rewards η ( π ) = η ( π ) = E [ r 0 + γ r 1 + γ 2 r 2 · · · + γ T − 1 r T − 1 ] with discount γ = [0 , 1]. Without γ , ◮ the agent has no incentive to do anything now . ◮ η will diverge. This means that the agent has an effective time-horizon t h ∼ 1 / (1 − γ ) 10 / 25

Model-based vs. Model-free In model-based RL, we try to learn the transition function p ( s ′ | s , a ). This let’s us predict the expected next state st + 1 given state s t and action a t . This means that the agent can think ahead and plan future actions. In model-free RL, we either try to learn π ( a | s ) directly (policy gradient methods), or we learn a function Q ( s , a ) that tells us the value of taking action a when in state s , which implies a π ( a | s ). This means that the agent has no “understanding” of the process and is essentially a lookup table. 11 / 25

Multi-agent RL 12 / 25

Parameterized policies / Deep RL If the total number of states is small, then Monte Carlo or dynamic programming techniques can be used to find π ( a | s ) or Q ( s , a ). These are sometimes referred to as tabular methods . In many cases, this is intractable. Instead, we need to use a function approximator, such as a neural network, to represent these functions π ( a | s ) → π ( a | s , θ ) , Q ( s , a ) → Q ( s , a | θ ) This takes advantage of the fact that in similar states we should take similar actions. 13 / 25

Discrete vs. continuous action spaces Similarly, agents can either select from a discrete set of actions (i.e. left vs. right) or a continuum (steer the boat to heading 136 degrees). I’m not sure why people make a big deal out of the difference ◮ discrete: π ( a | s ) is a discreete probability distribution. ◮ continuous: π ( a | s ) is (just about always) Gaussian. 14 / 25

On-policy vs. off-policy If our current best policy is π ( a | s ), do we sample from π ( a | s ) or do we sample from a different policy π ′ ( a | s )? ◮ on-policy: Learn from π ( a | s ), then update based on what worked well / didn’t work well. ◮ off-policy: Learn from π ′ ( a | s ) but update π ( a | s ), letting us explore areas of state-action space that aren’t likely to come up with our policy. *Can also learn from old experience* 15 / 25

Policy gradients In which we just go for it and maximize the policy directly. Define T � γ t r ( s ( t ) , a ( t )) R [ s ( T ) , a ( T )] ≡ t =0 We want to maximize R ( t ), which depends on the trajectory ∇ θ η ( θ ) = ∇ θ E [ R ] � = ∇ θ p ( R | θ ) R � = R ∇ θ p ( R | θ ) � = R p ( R | θ ) ∇ θ log p ( R | θ ) = E [ R ∇ θ log p ( R | θ )] 16 / 25

Policy gradients The probability of a trajectory is T − 1 � p ( R | θ ) = µ ( s 0 ) π ( a t | s t , θ ) p ( s t +1 | s t , a t ) t =0 which means that the derivative of it’s log doesn’t depend on the unknown transition function. This is model-free. T − 1 � ∇ θ log p ( R | θ ) = ∇ θ log π ( a t | s t ) t =0 T − 1 � � � ∇ θ η ( θ ) = E R ∇ θ log π ( a t | s t ) t =0 17 / 25

Policy gradients Expressing the gradient as an expectation value means we can sample trajectories T − 1 N T − 1 → 1 � � � � � � � ∇ θ log π ( a t | s t ) ∇ θ log π ( a t | s t ) E R R N t =0 i =1 t =0 and then do gradient descent on the policy θ → θ − α ∇ θ η ( θ ) Since the gradient update is derived explicitly from trajectories sampled from π ( a | s ), clearly this method is on-policy. 18 / 25

Policy gradients T − 1 � � � ∇ θ η ( θ ) = E R ∇ θ log π ( a t | s t ) t =0 � T − 1 T − 1 � � � γ t ′ − t r ( s t ′ , a t ′ ) = E ∇ θ log π ( a t | s t ) t =0 t ′ = t � T − 1 � � = E ∇ θ log π ( a t | s t ) Q π ( s t , a t ) t =0 � T − 1 �� = E ∇ θ log π ( a t | s t ) Q π ( s t , a t ) − V π ( s t ) t =0 T − 1 γ t ′ − t r ( s t ′ , a t ′ ) , � � Q π ( s t , a t ) ≡ V ( s t ) ≡ Q π ( s t , a t ) π ( a t | s t ) a t t ′ = t 19 / 25

Q-learning What if we instead learn the Q -function or state-action value function associated with the optimal policy? a ∗ = arg max Q ∗ ( s , a ) a 20 / 25

Q-learning is model free Just knowing the value function V ( s ) of the state for a policy isn’t enough to pick actions because we would need to know the transition function p ( s ′ | s , a ). 21 / 25

Q-learning is off-policy Expanding the definition of Q ( s t , a t ), we see Q π ( s t , a t ) = E [ r t + γ V π ( s t +1 )] � � Q π ( s t , a t ) = E r t + γ E [ Q π ( s t +1 , a t +1 )] This is known as temporal difference learning . Now, let’s find the optimal Q -function � � Q ∗ ( s t , a t ) = E r t + γ max a [ Q π ( s t +1 , a )] This is Q -learning. 22 / 25

DQN If we have too many states, we instead minimize the loss � a t +1 [ Q π ( s t +1 , a t +1 ) − Q θ ( s t , a t ) | 2 L ( θ ) = | r t + γ max t via gradient descent θ → θ − α ∇ θ L ( θ ) 23 / 25

Q learning II, the SQL Define � dx e f ( x ) soft max f ( x ) ≡ log x Then, soft Q-learning is Q ∗ ( s t , a t ) = E [ r t + γ soft max Q ( s t +1 , a )] a which has optimal policy π ( a | s ) ∝ exp Q ( s , t ) . Trade-off between optimality and entropy. Allows transfer learning by letting policies compose. 24 / 25

A Distributional Perspective on Reinforcement Learning Learn a distribution over Q -values. Let Z ( s t , a t ) have an expectation value that is Q ( s , a ). Then we learn Z ( s t , a t ) = r t + γ Z ( s t +1 , a t +1 ) 25 / 25

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective - PowerPoint PPT Presentation

Deep he(a)p, big feat arXiv:1707.06887 A Distributional Perspective on Reinforcement Learning arXiv:1702.08165 Reinforcement Learning with Deep Energy-Based Policies 1 / 25 Reinforcement Learning Environment Action Reward Interpreter State

Feat u re e x traction D IME N SION AL ITY R E D U C TION IN P YTH ON Jeroen Boe y e Machine

Feat u re engineering P R E P R OC E SSIN G FOR MAC H IN E L E AR N IN G IN P YTH ON Sarah G

Feat u re Generation FE ATU R E E N G IN E E R IN G W ITH P YSPAR K John Hog u e Lead Data

Feat ure Select ion using/ f or Feat ure Select ion using/ f or Transduct ive ransduct ive S

Regression : feat u re selection P R AC TIC IN G MAC H IN E L E AR N IN G IN TE R VIE W QU E

Feat u re crossing FE ATU R E E N G IN E E R IN G IN R Jose Hernande z Data Scientist , Uni v

Wh y generate feat u res ? FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

I Prefer Pi Corey Sinnamon Febuary 3, 2015 Big Day 3/14/15 Big Day 3/14/15 Themes Big

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Deep Reinforcement Learning M. Soleymani Sharif University of Technology Fall 2017 Slides are

CSC2523: Deep Learning in Computer Vision Introduction Sanja Fidler January 12, 2016 Sanja

Retrieving Comparative Arguments using Deep Pre-trained Language Models and NLU Viktoriia

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Model-Free Control (Reinforcement Learning) and Deep Learning M ARC G. B ELLEMARE Google Brain

Markov Decision Processes School of Data Science, Fudan

Int Introductio ion t n to Deep Deep Lea earn rning Prof. Leal-Taix and Prof. Niessner 1

Parallel Programming Libraries and Implementations Reusing this material This work is licensed