Deep Learning Hyperparameter Optimization with Competing Objectives - - PowerPoint PPT Presentation
Deep Learning Hyperparameter Optimization with Competing Objectives - - PowerPoint PPT Presentation
Deep Learning Hyperparameter Optimization with Competing Objectives GTC 2018 - S8136 Scott Clark scott@sigopt.com OUTLINE 1. Why is Tuning Models Hard? 2. Common Tuning Methods 3. Deep Learning Example 4. Tuning Multiple Metrics 5.
OUTLINE
- 1. Why is Tuning Models Hard?
- 2. Common Tuning Methods
- 3. Deep Learning Example
- 4. Tuning Multiple Metrics
- 5. Multi-metric Optimization Examples
Deep Learning / AI is extremely powerful Tuning these systems is extremely non-intuitive
Photo: Joe Ross
TUNABLE PARAMETERS IN DEEP LEARNING
TUNABLE PARAMETERS IN DEEP LEARNING
TUNABLE PARAMETERS IN DEEP LEARNING
TUNABLE PARAMETERS IN DEEP LEARNING
TUNABLE PARAMETERS IN DEEP LEARNING
Photo: Tammy Strobel
STANDARD METHODS FOR HYPERPARAMETER SEARCH
STANDARD TUNING METHODS
Parameter Configuration
?
Grid Search Random Search Manual Search
- Weights
- Thresholds
- Window sizes
- Transformations
ML / AI Model Testing Data Cross Validation Training Data
OPTIMIZATION FEEDBACK LOOP
Objective Metric
Better Results
REST API New configurations
ML / AI Model Testing Data Cross Validation Training Data
DEEP LEARNING EXAMPLE
- Classify movie reviews
using a CNN in MXNet
SIGOPT + MXNET
https://aws.amazon.com/blogs/machine-learning/fast-cnn-tuning-with-aws-gpu-instances-and-sigopt/
TEXT CLASSIFICATION PIPELINE
ML / AI Model
(MXNet)
Testing Text Validation
Accuracy
Better Results
REST API Hyperparameter Configurations and Feature Transformations
Training Text
- Comparison of several RMSProp SGD parametrizations
STOCHASTIC GRADIENT DESCENT
ARCHITECTURE PARAMETERS
MULTIPLICATIVE TUNING SPEED UP
SPEED UP #1: CPU -> GPU
SPEED UP #2: RANDOM/GRID -> SIGOPT
CONSISTENTLY BETTER AND FASTER
TUNING MULTIPLE METRICS
What if we want to optimize multiple competing metrics?
- Complexity Tradeoffs
○ Accuracy vs Training Time ○ Accuracy vs Inference Time
- Business Metrics
○ Fraud Accuracy vs Money Lost ○ Conversion Rate vs LTV ○ Engagement vs Profit ○ Profit vs Drawdown
PARETO OPTIMAL
What does it mean to optimize two metrics simultaneously? Pareto efficiency or Pareto optimality is a state of allocation of resources from which it is impossible to reallocate so as to make any one individual or preference criterion better off without making at least
- ne individual or preference criterion worse off.
PARETO OPTIMAL
What does it mean to optimize two metrics simultaneously?
The red points are on the Pareto Efficient Frontier, they strictly dominate all of the grey points. You can do no better in one metric without sacrificing performance in the other. Point N is Pareto Optimal compared to Point K.
PARETO EFFICIENT FRONTIER
Goal is to have best set of feasible solutions to select from
After optimization the expert picks
- ne or more of the red points from
the Pareto Efficient Frontier to further study or put into production.
TOY EXAMPLE
MULTI-METRIC OPTIMIZATION
DEEP LEARNING EXAMPLES
MULTI-METRIC OPT IN DEEP LEARNING
https://devblogs.nvidia.com/sigopt-deep-learning-hyperparameter-optimization/
DEEP LEARNING TRADEOFFS
- Deep Learning pipelines are time
consuming and expensive to run
- Application and deployment
conditions may make certain configurations less desirable
- Tuning for both accuracy and
complexity metrics like training or inference time allows expert to make best decision for production
- Comparison of several RMSProp SGD parametrizations
- Different configurations converge differently
STOCHASTIC GRADIENT DESCENT
TEXT CLASSIFICATION PIPELINE
ML / AI Model
(MXNet)
Testing Text Validation
Accuracy
Better Results
REST API Hyperparameter Configurations and Feature Transformations
Training Text
Training Time
FINDING THE FRONTIER
SEQUENCE CLASSIFICATION PIPELINE
ML / AI Model
(Tensorflow)
Testing Sequences Validation
Accuracy
Better Results
REST API Hyperparameter Configurations and Feature Transformations
Training Sequences
Inference Time
TEXT CLASSIFICATION PIPELINE
FINDING THE FRONTIER
FINDING THE FRONTIER
LOAN CLASSIFICATION PIPELINE
ML / AI Model
(LightGBM)
Testing Data Validation
AUCPR
Better Results
REST API Hyperparameter Configurations and Feature Transformations
Training Data
Avg $ Lost
GRID SEARCH CAN MISLEAD
- Best grid search point (wrt
accuracy) loses >$35 / transaction
- Best grid search point (wrt loss)
has 70% accuracy
- Points of the Pareto Frontier give
user more information about what is possible and more control of trade-offs
DISTRIBUTED TRAINING/SCHEDULING
- SigOpt serves as a distributed
scheduler for training models across workers
- Workers access the SigOpt API
for the latest parameters to try for each model
- Enables easy distributed
training of non-distributed algorithms across any number
- f models
TAKEAWAYS
One metric may not paint the whole picture
- Think about metric trade-offs in your model pipelines
- Optimizing for the wrong thing can be very expensive
Not all optimization strategies are equal
- Pick an optimization strategy that gives the most flexibility
- Different tools enable you to tackle new problems