Unsupervised Meta-Learning for Reinforcement Learning LAMDA, - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . . Unsupervised Meta-Learning for Reinforcement Learning 田鸿龙 LAMDA, Nanjing University . . . . . . . . . . . . . . . . . . . . . . . November 9, 2020

. . . . . . . . . . . . . . . . . Table of Contents Preliminaries Knowledge A Unsupervised RL Algorithm: Diversity is all you need . . . . . . . . . . . . . . . . . . . . . . . Unsupervised Meta-Learning for Reinforcement Learning

. . . . . . . . . . . . . . . . . . Terminology . . . . . . . . . . . . . . . . . . . . . . • task: a problem needs RL Algorithm to solve • MDP = CMP + Reward Mechanisms • one-to-one correspondence between MDP and task • CMP: controlled Markov process • namely the dynamics of the environments • consist of state space, action space, initial state distribution, transition dynamics ... • Reward Mechanisms: r ( s , a , s ′ , t )

. . . . . . . . . . . . . . . . . . Terminology(cont.) consistent way . . . . . . . . . . . . . . . . . . . . . . • skill: a latent-conditioned policy that alters that state of the environment in a • there is a fjxed latent variable distribution p ( z ) • Z ∼ p ( z ) is a latent variable, policy conditioned on a fjxed Z as a ”skill” • policy(skill) = parameter θ + latent variable Z

. . . . . . . . . . . . . . . . . . Mutual Information dependence between the two variables . . . . . . . . . . . . . . . . . . . . . . • mutual information (MI) of two random variables is a measure of the mutual p ( x , y ) ln p ( x ) p ( y ) • I ( x , y ) = KL [ p ( x , y ) ∥ p ( x ) p ( y )] = − ∫∫ p ( x , y ) d x d y • Kullback – Leibler divergence: a directed divergence between two distributions • the larger of MI, the more divergent between P ( x , y ) and P ( x ) P ( y ) , the more dependent between P ( x ) and P ( y ) • or I ( x , y ) = H ( x ) − H ( x | y ) • H ( y | x ) = − ∫∫ p ( x , y ) ln p ( y | x ) d y d x

. . . . . . . . . . . . . . . . . Motivation exploration reward requires human feedback. agent(without imitation sample, hard to design a reward funciton) . . . . . . . . . . . . . . . . . . able to learn . . . . . • Autonomous acquisition of useful skills without any reward signal. • Why without any reward signal? • for sparse rewards setting, learning useful skills without supervision may help address challenges in • serve as primitives for hierarchical RL, efgectively shortening the episode length • in many practical settings, interacting with the environment is essentially free, but evaluating the • it is challenging to design a reward function that elicits the desired behaviors from the • when given an unfamiliar environment, it is challenging to determine what tasks an agent should be

. . . . . . . . . . . . . . . . . . Motivation(cont.) maximizing the utility of this set . . . . . . . . . . . . . . . . . . . . . . • Autonomous acquisition of useful skills without any reward signal. • How to defjne ”useful skills”? • consider the setting where the reward function is unknown, so we want to learn a set of skills by • How to maximize the utility of this set? • each skill individually is distinct • the skills collectively explore large parts of the state space

. . . . . . . . . . . . . . . . . . Key Idea: Using discriminability between skills as an objective observer) . . . . . . . . . . . . . . . . . . . . . . • design a reward function which only depends on CMP • skills are just distinguishable ✗ • skills diverse in a semantically meaningful way ✓ • action distributions ✗ (actions that do not afgect the environment are not visible to an outside • state distributions ✓

. . . . . . . . . . . . . . . . . How It Works 2 ensure that states, not actions, are used to distinguish skills 3 viewing all skills together with p(z) as a mixture of policies, we maximize the . . . . . . . . . . . . . . . . . . . . . . . 1 skill to dictate the states that the agent visits • one-to-one correspondence between skill and Z(for any certain time, parameters θ is fjxed) • Z ∼ p ( z ) , which means Z is difgerent with each other • make state distributions depend on Z(vice versa.), then state distributions become diverse • given state, action is not related to skill • make action directly depends on skill is a trivial method, we better avoid it entropy H [ A | S ] • Attention: 2 maybe causes the network don’t care input Z, but 1 avoids it; maybe causes output(action) become same one, but 3 avoids it F ( θ ) ≜ I ( S ; Z ) + H [ A | S ] − I ( A ; Z | S ) = ( H [ Z ] − H [ Z | S ]) + H [ A | S ] − ( H [ A | S ] − H [ A | S , Z ]) = H [ Z ] − H [ Z | S ] + H [ A | S , Z ]

. . . . . . . . . . . . . . . . . How It Works(cont.) 1 fjx p(z) to be uniform in our approach, guaranteeing that is has maximum entropy 2 it should be easy to infer the skill z from the current state . . . . . . . . . . . . . . . . . . . . . . . 3 each skill should act as randomly as possible F ( θ ) ≜ I ( S ; Z ) + H [ A | S ] − I ( A ; Z | S ) = ( H [ Z ] − H [ Z | S ]) + H [ A | S ] − ( H [ A | S ] − H [ A | S , Z ]) = H [ Z ] − H [ Z | S ] + H [ A | S , Z ]

. . . . . . . . . . . . . . . . . . How It Works(cont.) . . . . . . . . . . . . . . . . . . . . . . F ( θ ) = H [ A | S , Z ] − H [ Z | S ] + H [ Z ] = H [ A | S , Z ] + E z ∼ p ( z ) , s ∼ π ( z ) [ log p ( z | s )] − E z ∼ p ( z ) [ log p ( z )] ≥ H [ A | S , Z ] + E z ∼ p ( z ) , s ∼ π ( z ) [ log q φ ( z | s ) − log p ( z )] ≜ G ( θ, φ ) • G ( θ, φ ) is a variational lower bound

. . . . . . . . . . . . . . . . . . Implementation pseudo-reward by SAC . . . . . . . . . . . . . . . . . . . . . . Learned SKILL Fixed Sample one skill per • maxize a cumulative episode from fixed skill distribution. ENVIRONMENT • pseudo-reward: r z ( s , a ) ≜ log q φ ( z | s ) − log p ( z ) Discriminator estimates skill DISCRIMINATOR from state. Update discriminator Update skill to maximize to maximize discriminability. discriminability.

. . . . . . . . . . . . . . . . . Algorithm Algorithm 1: DIAYN while not converged do . . . . . . . . . . . . . . . . . . . . . . . Sample skill z ∼ p ( z ) and initial state s 0 ∼ p 0 ( s ) for t ← 1 to steps _ per _ episode do Sample action a t ∼ π θ ( a t | s t , z ) from skill. Step environment: s t +1 ∼ p ( s t +1 | s t , a t ) . Compute q φ ( z | s t +1 ) with discriminator. Set skill reward r t = log q φ ( z | s t +1 ) − log p ( z ) Update policy ( θ ) to maximize r t with SAC. Update discriminator ( φ ) with SGD.

. . . . . . . . . . . . . . . . . . Applications . . . . . . . . . . . . . . . . . . . . . . • adapting skills to maximize a reward • hierarchical RL • imitation learning • unsupervised meta RL

. . . . . . . . . . . . . . . . . Motivation the tasks that will be provided for meta-testing distribution, and meta-learning algorithms generalize best to new tasks which are drawn from the same distribution as the meta-training tasks . . . . . . . . . . . . . . . . . . . . . . . • aim to do so without depending on any human supervision or information about • assumptions of prior work ✗ • a fjxed tasks distribution • tasks of meta-train and meta-test are sample from this distribution • Why not pre-specifjed task distribution? • specifying a task distribution is tedious and requires a signifjcant amount of supervision • the performance of meta-learning algorithms critically depends on the meta-training task • assumptions of this work: the environment dynamics(CMP) remain the same • ”environment-specifjc learning procedure”

. . . . . . . . . . . . . . . . . . Attention . . . . . . . . . . . . . . . . . . . . . . • this paper have been rejected(maybe twice) • this paper make some vary strong assumption when analysising: • deterministic dynamics(the ”future work” of 2018, but authors maybe forget it...) • only get a reward when the end state(two case have been concerned) • the expriment may be not enough and convincing • there are something wrong (at least ambiguous) in the paper...

Unsupervised Meta-Learning for Reinforcement Learning LAMDA, - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . . Unsupervised Meta-Learning for Reinforcement Learning LAMDA, Nanjing University . . . . . . . . . . . . . . . . . . . . . . . November 9, 2020 . . . . . .

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Meta Reinforcement Learning as Task Inference Jan Humplik, Alexandre Galashov, Leonard

Meta Reinforcement Learning Kate Rakelly 11/13/19 Questions we seek to answer Motivation : What

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Neutrinos & Nuclei 56th International meeting on Nuclear Physics in collaboration with: S.

Composite Dark Matter Part DM theory Interest Craziness 1 SM !!! ??? 2 SM + heavy Q !!

Experimental Results on Jets in Heavy Ion Collisions Yen-Jie Lee Massachusetts Institute of

' $ Multiple Sattering in GEANT4, latest dev elopmen ts 1 Multiple Sattering in

Study of N Doped Samples before and after Electrochemistry Wydglif Dorlus Supervisor: Anna

s r q t

Last time: curl and div Let F = P , Q , R be a vector field on D R 3 . curl F =

Adapting Rabins Theorem for Differential Fields Russell Miller & Alexey Ovchinnikov

Sambuz

Useful Links

Newsletter

Mail Us

Unsupervised Meta-Learning for Reinforcement Learning LAMDA, - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . . Unsupervised Meta-Learning for Reinforcement Learning LAMDA, Nanjing University . . . . . . . . . . . . . . . . . . . . . . . November 9, 2020 . . . . . .

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Meta Reinforcement Learning as Task Inference Jan Humplik, Alexandre Galashov, Leonard

Meta Reinforcement Learning Kate Rakelly 11/13/19 Questions we seek to answer Motivation : What

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Neutrinos &amp; Nuclei 56th International meeting on Nuclear Physics in collaboration with: S.

Composite Dark Matter Part DM theory Interest Craziness 1 SM !!! ??? 2 SM + heavy Q !!

Experimental Results on Jets in Heavy Ion Collisions Yen-Jie Lee Massachusetts Institute of

' $ Multiple Sattering in GEANT4, latest dev elopmen ts 1 Multiple Sattering in

Study of N Doped Samples before and after Electrochemistry Wydglif Dorlus Supervisor: Anna

s r q t

Last time: curl and div Let F = P , Q , R be a vector field on D R 3 . curl F =

Adapting Rabins Theorem for Differential Fields Russell Miller &amp; Alexey Ovchinnikov

Sambuz

Useful Links

Newsletter

Mail Us

Neutrinos & Nuclei 56th International meeting on Nuclear Physics in collaboration with: S.

Adapting Rabins Theorem for Differential Fields Russell Miller & Alexey Ovchinnikov