Assessing Generalization in Deep Reinforcement Learning Soo Jung - - PowerPoint PPT Presentation

assessing generalization in deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Assessing Generalization in Deep Reinforcement Learning Soo Jung - - PowerPoint PPT Presentation

Assessing Generalization in Deep Reinforcement Learning Soo Jung Jang Background Before (ex: factory robot) Now (ex: human-like intelligence) focus on one environment apply to multiple environment generalization is not considered


slide-1
SLIDE 1

Assessing Generalization in Deep Reinforcement Learning

Soo Jung Jang

slide-2
SLIDE 2

Background

Before (ex: factory robot) Now (ex: human-like intelligence) focus on one environment apply to multiple environment generalization is not considered generalization is important

  • Paper’s Goal: Empirical study of generalization in deep RL

with different (1) algorithms, (2) environments, and (3) metrics

slide-3
SLIDE 3

Algorithms

  • Vanilla (Baseline) Algorithms

○ A2C: Actor-Critic Family ○ PPO: Policy-Gradient Family

  • Generalization-Tackling Algorithms

○ EPOpt: Robust Approach ○ RL2: Adapt Approach

  • 6 Algorithms Total

A2C, PPO, EPOpt-A2C, EPOpt-PPO, RL2-A2C, RL2-PPO

slide-4
SLIDE 4

Algorithms - Vanilla

  • A2C / Actor-Critic Family

○ Critic: learns a value function ○ Actor: uses the function to learn a policy that maximizes expected reward

  • PPO / Policy-Gradient Family

○ Learn sequence of improving policies ○ Maximize surrogate for the expected reward via gradient ascent

slide-5
SLIDE 5

Algorithms - Generalization-Tackling

  • EPOpt / Robust Approach

○ Maximize expected reward over subset of environments with lowest expected reward (Maximize conditional value at risk)

  • RL2 / Adapt Approach

○ Learn environment embedding at test time “on-the-fly” ○ RNN with current trajectory as input / hidden states = embeddings

slide-6
SLIDE 6

Algorithms - Network Architecture

  • Feed Forward (FF)

○ Multi-layer perceptron (MLP)

  • Recurrent (RC)

○ LSTM on top of MLP

  • 4 Non-RL2 Algorithms → Test on both FF and RC
  • 2 RL2 Algorithms → Test only on RC
slide-7
SLIDE 7

Environments

  • 6 Environments (OpenAI)

CartPole MountainCar AcroBot Pendulum HalfCheetah Hopper

slide-8
SLIDE 8

Metrics - Environment Parameters

  • Deterministic (D)

○ fixed at default value (fixed environment)

  • Random (R)

○ uniformly sampled from d-dimensional box (feasible environment)

  • Extreme (E)

○ uniformly sampled from union of 2 intervals that straddle corresponding interval in R (edge cases)

Schematic (d=2 and 4 samples)

slide-9
SLIDE 9

Metrics - Evaluation

  • 3 Evaluation Metrics

From 3x3 train-test pairs of (D/R/E) 1. Default: DD 2. Interpolation: RR 3. Extrapolation: Mean of DR, DE, RE

  • Metric Value (Performance)

○ Success Rate (%) of episodes where a certain goal is completed

slide-10
SLIDE 10

Experiment

  • Compare performance of:

○ 10 algo combinations (6 algorithms / 2 architectures) ○ 6 environments ○ 3 metrics (default, interpolation, extrapolation)

  • Methodology

Train 15000 episodes / Test 1000 episodes

  • Fairness

No memory of previous episode Several sweeps of hyperparameters Success rate instead of reward itself

slide-11
SLIDE 11

Results

  • Default > Inter > Extrapolation
  • FF > RC Architecture
  • Vanilla > Generalization-Tackling
  • RL2 variants do not work
  • EPOpt-PPO works well in

continuous action space (Pendulum, ½ Cheetah, Hopper)

slide-12
SLIDE 12

Discussion Questions

  • Generalization-tackling algorithms tested in this paper failed.

What would be a potential strategy that makes generalization work? How would you solve this RL generalization problem?

  • Why do you think generalization-tackling algorithms and recurrent (RC)

architectures perform worse than Vanilla and feed forward (FF)? When would you expect generalized-tackling algorithms and recurrent (RC) architectures to work better?

  • Do you think the paper’s experiment methodology is fair? Is there a better

way to evaluate generalization on different algorithms and architectures?