Assessing Generalization in Deep Reinforcement Learning Soo Jung - - PowerPoint PPT Presentation
Assessing Generalization in Deep Reinforcement Learning Soo Jung - - PowerPoint PPT Presentation
Assessing Generalization in Deep Reinforcement Learning Soo Jung Jang Background Before (ex: factory robot) Now (ex: human-like intelligence) focus on one environment apply to multiple environment generalization is not considered
Background
Before (ex: factory robot) Now (ex: human-like intelligence) focus on one environment apply to multiple environment generalization is not considered generalization is important
- Paper’s Goal: Empirical study of generalization in deep RL
with different (1) algorithms, (2) environments, and (3) metrics
Algorithms
- Vanilla (Baseline) Algorithms
○ A2C: Actor-Critic Family ○ PPO: Policy-Gradient Family
- Generalization-Tackling Algorithms
○ EPOpt: Robust Approach ○ RL2: Adapt Approach
- 6 Algorithms Total
A2C, PPO, EPOpt-A2C, EPOpt-PPO, RL2-A2C, RL2-PPO
Algorithms - Vanilla
- A2C / Actor-Critic Family
○ Critic: learns a value function ○ Actor: uses the function to learn a policy that maximizes expected reward
- PPO / Policy-Gradient Family
○ Learn sequence of improving policies ○ Maximize surrogate for the expected reward via gradient ascent
Algorithms - Generalization-Tackling
- EPOpt / Robust Approach
○ Maximize expected reward over subset of environments with lowest expected reward (Maximize conditional value at risk)
- RL2 / Adapt Approach
○ Learn environment embedding at test time “on-the-fly” ○ RNN with current trajectory as input / hidden states = embeddings
Algorithms - Network Architecture
- Feed Forward (FF)
○ Multi-layer perceptron (MLP)
- Recurrent (RC)
○ LSTM on top of MLP
- 4 Non-RL2 Algorithms → Test on both FF and RC
- 2 RL2 Algorithms → Test only on RC
Environments
- 6 Environments (OpenAI)
CartPole MountainCar AcroBot Pendulum HalfCheetah Hopper
Metrics - Environment Parameters
- Deterministic (D)
○ fixed at default value (fixed environment)
- Random (R)
○ uniformly sampled from d-dimensional box (feasible environment)
- Extreme (E)
○ uniformly sampled from union of 2 intervals that straddle corresponding interval in R (edge cases)
Schematic (d=2 and 4 samples)
Metrics - Evaluation
- 3 Evaluation Metrics
From 3x3 train-test pairs of (D/R/E) 1. Default: DD 2. Interpolation: RR 3. Extrapolation: Mean of DR, DE, RE
- Metric Value (Performance)
○ Success Rate (%) of episodes where a certain goal is completed
Experiment
- Compare performance of:
○ 10 algo combinations (6 algorithms / 2 architectures) ○ 6 environments ○ 3 metrics (default, interpolation, extrapolation)
- Methodology
Train 15000 episodes / Test 1000 episodes
- Fairness
No memory of previous episode Several sweeps of hyperparameters Success rate instead of reward itself
Results
- Default > Inter > Extrapolation
- FF > RC Architecture
- Vanilla > Generalization-Tackling
- RL2 variants do not work
- EPOpt-PPO works well in
continuous action space (Pendulum, ½ Cheetah, Hopper)
Discussion Questions
- Generalization-tackling algorithms tested in this paper failed.
What would be a potential strategy that makes generalization work? How would you solve this RL generalization problem?
- Why do you think generalization-tackling algorithms and recurrent (RC)
architectures perform worse than Vanilla and feed forward (FF)? When would you expect generalized-tackling algorithms and recurrent (RC) architectures to work better?
- Do you think the paper’s experiment methodology is fair? Is there a better
way to evaluate generalization on different algorithms and architectures?