assessing generalization in deep reinforcement learning
play

Assessing Generalization in Deep Reinforcement Learning Soo Jung - PowerPoint PPT Presentation

Assessing Generalization in Deep Reinforcement Learning Soo Jung Jang Background Before (ex: factory robot) Now (ex: human-like intelligence) focus on one environment apply to multiple environment generalization is not considered


  1. Assessing Generalization in Deep Reinforcement Learning Soo Jung Jang

  2. Background Before (ex: factory robot) Now (ex: human-like intelligence) focus on one environment apply to multiple environment generalization is not considered generalization is important ● Paper’s Goal: Empirical study of generalization in deep RL with different (1) algorithms, (2) environments, and (3) metrics

  3. Algorithms ● Vanilla (Baseline) Algorithms ○ A2C: Actor-Critic Family ○ PPO: Policy-Gradient Family ● Generalization-Tackling Algorithms ○ EPOpt: Robust Approach ○ RL2: Adapt Approach ● 6 Algorithms Total A2C, PPO, EPOpt-A2C, EPOpt-PPO, RL2-A2C, RL2-PPO

  4. Algorithms - Vanilla ● A2C / Actor-Critic Family ○ Critic: learns a value function ○ Actor: uses the function to learn a policy that maximizes expected reward ● PPO / Policy-Gradient Family ○ Learn sequence of improving policies ○ Maximize surrogate for the expected reward via gradient ascent

  5. Algorithms - Generalization-Tackling ● EPOpt / Robust Approach ○ Maximize expected reward over subset of environments with lowest expected reward (Maximize conditional value at risk) ● RL2 / Adapt Approach ○ Learn environment embedding at test time “on-the-fly” ○ RNN with current trajectory as input / hidden states = embeddings

  6. Algorithms - Network Architecture ● Feed Forward (FF) ○ Multi-layer perceptron (MLP) ● Recurrent (RC) ○ LSTM on top of MLP ● 4 Non-RL2 Algorithms → Test on both FF and RC ● 2 RL2 Algorithms → Test only on RC

  7. Environments ● 6 Environments (OpenAI) CartPole MountainCar AcroBot Pendulum HalfCheetah Hopper

  8. Metrics - Environment Parameters ● Deterministic (D) ○ fixed at default value (fixed environment) ● Random (R) ○ uniformly sampled from d -dimensional box (feasible environment) ● Extreme (E) ○ uniformly sampled from union of 2 intervals that straddle corresponding interval in R (edge cases) Schematic ( d =2 and 4 samples)

  9. Metrics - Evaluation ● 3 Evaluation Metrics From 3x3 train-test pairs of (D/R/E) 1. Default: DD 2. Interpolation: RR 3. Extrapolation: Mean of DR, DE, RE ● Metric Value (Performance) ○ Success Rate (%) of episodes where a certain goal is completed

  10. Experiment ● Compare performance of: ○ 10 algo combinations (6 algorithms / 2 architectures) ○ 6 environments ○ 3 metrics (default, interpolation, extrapolation) ● Methodology Train 15000 episodes / Test 1000 episodes ● Fairness No memory of previous episode Several sweeps of hyperparameters Success rate instead of reward itself

  11. Results ● Default > Inter > Extrapolation ● FF > RC Architecture ● Vanilla > Generalization-Tackling ● RL2 variants do not work ● EPOpt-PPO works well in continuous action space (Pendulum, ½ Cheetah, Hopper)

  12. Discussion Questions ● Generalization-tackling algorithms tested in this paper failed. What would be a potential strategy that makes generalization work? How would you solve this RL generalization problem? ● Why do you think generalization-tackling algorithms and recurrent (RC) architectures perform worse than Vanilla and feed forward (FF)? When would you expect generalized-tackling algorithms and recurrent (RC) architectures to work better? ● Do you think the paper’s experiment methodology is fair? Is there a better way to evaluate generalization on different algorithms and architectures?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend