cs885 reinforcement learning module 3 july 5 2020
play

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation - PowerPoint PPT Presentation

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957). Ho, J., & Ermon, S. (2016). Generative adversarial


  1. CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957). Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In NeurIPS (pp. 4565-4573). University of Waterloo CS885 Spring 2020 Pascal Poupart 1

  2. Imitation Learning • Behavioural cloning (supervised learning) • Generative adversarial imitation learning (GAIL) • Imitation learning from observations • Inverse reinforcement learning University of Waterloo CS885 Spring 2020 Pascal Poupart 2

  3. Motivation • Learn from expert demonstrations – No reward function needed – Faster learning autonomous driving chatbots robotics University of Waterloo CS885 Spring 2020 Pascal Poupart 3

  4. Behavioural Cloning • Simplest form of imitation learning • Assumption: state-action pairs observable Imitation learning • Observe trajectories: 𝑡 ! , 𝑏 ! , 𝑡 " , 𝑏 " , 𝑡 # , 𝑏 # , … , (𝑡 $ , 𝑏 $ ) • Create training set: 𝑇 → 𝐵 • Train by supervised learning – Classification or regression University of Waterloo CS885 Spring 2020 Pascal Poupart 4

  5. Case Study I: Autonomous driving Bojarski et al. (2016) End-to-end learning for self-driving cars • On road tests: • – Holmdel to Atlantic Highlands (NJ): autonomous ~98% of the time – Garden State Parkway (10 miles): no human intervention University of Waterloo CS885 Spring 2020 Pascal Poupart 5

  6. Case study II: conversational agents Encoder: state 𝒕 am fine I How are you doing ? Decoder: action 𝒃 Pr 𝒃 𝒕 = ∏ " Pr 𝑏 " 𝑏 "#$ , … , 𝑏 $ , 𝒕 Objective: max 𝐛 Sordoni et al., 2015 University of Waterloo CS885 Spring 2020 Pascal Poupart 6

  7. Generative adversarial imitation learning (GAIL) • Common approach: training generator to maximize likelihood of expert actions • Alternative: train generator to fool a discriminator in believing that the generated actions are from expert – Leverage GANs (Generative adversarial networks) – Ho & Ermon, 2016 University of Waterloo CS885 Spring 2020 Pascal Poupart 7

  8. Generative adversarial networks (GANs) real random 𝑕 ! : generator 𝑒 " : discriminator 𝑨 𝑦 + or vector fake StyleGAN2 (Karras et al., 2020) real real 𝑒 " : discriminator 𝑦 or data fake CelebA (Liu et al., 2015) min ! max " , log Pr 𝑦 # 𝑗𝑡 𝑠𝑓𝑏𝑚; 𝑥 + log(Pr(𝑕 ! 𝑨 # 𝑗𝑡 𝑔𝑏𝑙𝑓; 𝑥) # = min ! max " , log 𝑒 " 𝑦 # + log 1 − 𝑒 " 𝑕 ! 𝑨 # # University of Waterloo CS885 Spring 2020 Pascal Poupart 8

  9. GAIL Pseudocode Input: expert trajectories 𝜐 $ ∼ 𝜌 $%&$'( where 𝜐 $ = 𝑡 ) , 𝑏 ) , 𝑡 * , 𝑏 * , … Initialize params 𝜄 of policy 𝜌 ! and params 𝑥 of discriminator 𝑒 " Repeat until stopping criterion Update discriminator parameters: 𝜀 " = ∑ +,- ∈ / ! ∇ 0 log 𝑒 " (𝑡, 𝑏) + ∑ +,-∼2 " (-|+) ∇ " log(1 − 𝑒 " (𝑡, 𝑏)) 𝑥 ← 𝑥 + 𝛽 " 𝜀 " Update policy parameters with TRPO: 𝐷𝑝𝑡𝑢(𝑡 6 , 𝑏 6 ) = ∑ +,-|+ # ,- # ,2 " log(1 − 𝑒 " (𝑡, 𝑏)) 𝜀 ! = ∑ +,-|2 " ∇ ! log 𝜌 ! 𝑏 𝑡 𝐷𝑝𝑡𝑢 𝑡, 𝑏 − 𝜇∇ ! 𝐼(𝜌 ! ) 𝜄 ← 𝜄 − 𝛽 ! 𝜀 ! University of Waterloo CS885 Spring 2020 Pascal Poupart 9

  10. Robotics Experiments • Robot imitating expert policy (Ho & Ermon, 2016) GAIL University of Waterloo CS885 Spring 2020 Pascal Poupart 10

  11. Imitation Learning from Observations • Consider imitation learning from a human expert: Schaal et al., 2003 • Actions (e.g., forces) unobservable • Only states/observations (e.g., joint positions) observable • Problem: infer actions from state/observation sequences University of Waterloo CS885 Spring 2020 Pascal Poupart 11

  12. Inverse Dynamics Two steps: 1. Learn inverse dynamics Learn Pr(𝑏|𝑡, 𝑡 7 ) by supervised learning – From (𝑡, 𝑏, 𝑡 7 ) samples obtained by executing random actions – 2. Behavioural cloning Learn 𝜌(W 𝑏|𝑡) by supervised learning – From (𝑡, 𝑡’) samples from expert trajectories and – 𝑏 ~ Pr(𝑏|𝑡, 𝑡 7 ) sampled by inverse dynamics from W University of Waterloo CS885 Spring 2020 Pascal Poupart 12

  13. Pseudocode: Imitation Learning from Observations Input: expert trajectories 𝜐 ! ∼ 𝜌 !"#!$% where 𝜐 ! = 𝑡 & , 𝑡 ' , 𝑡 ( , … Initialize agent policy 𝜌 ) at random Repeat Learn inverse dynamics model with parameters 𝑥 : * # , 𝑏 % (* # ) , 𝑡 %-& * # Sample 𝑡 % by executing 𝜌 ) * # , 𝑡 %-& (* # ) |𝑡 % (* # ) ) 𝑥 ← 𝑏𝑠𝑕𝑛𝑏𝑦 . ∑ % log Pr . (𝑏 % Learn policy parameters 𝜄 : / $ , 𝑡 %-& / $ For each 𝑡 % from expert trajectories 𝜐 ! do: / $ ∼ Pr(𝑏 % / $ |𝑡 % / $ , 𝑡 %-& (/ $ ) ) 𝑏 % 9 / $ |𝑡 % (/ $ ) ) 𝜄 ← 𝑏𝑠𝑕𝑛𝑏𝑦 ) ∑ % log 𝜌 ) (9 𝑏 % University of Waterloo CS885 Spring 2020 Pascal Poupart 13

  14. Robotics Experiments Torabi et al., 2018 University of Waterloo CS885 Spring 2020 Pascal Poupart 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend