AutoSys: The Design and Operation of Learning-Augmented Systems - PowerPoint PPT Presentation

AutoSys: The Design and Operation of Learning-Augmented Systems Chieh-Jan Mike Liang, Hui Xue, Mao Yang, Lidong Zhou, Lifei Zhu, Zhao Lucis Li, Zibo Wang, Qi Chen, Quanlu Zhang, Chuanjie Liu, Wenjun Dai Microsoft Research, Peking University, USTC, Bing Platform, Bing Ads USENIX ATC 20

Learning-Augmented Systems • Systems whose design methodology or control logic is at the intersection of traditional heuristics and machine learning • Not a stranger to academic communities: “Workshop on ML for Systems”, “ MLSys Conference”, … • This work reports our years of experience in designing and operating learning- augmented systems in production 1. AutoSys framework 2. Long-term operation lessons

Our Scope in This Paper: Auto-tuning System Config Parameters • The problem is simple… • A great application of black-box optimization • Find the configuration that best optimizes the performance counters System Input s Outputs Software Storage Performance Configuration Hardware Network counters parameters

Our Scope in This Paper: Auto-tuning System Config Parameters • But , the problem is very difficult for system operators in practice… • Vast system-specific parameter search space • Continual optimization based on system-specific triggers System Input s Outputs Software Storage Performance Configuration Hardware Network counters parameters

Our Scope in This Paper: Bing Web Search Selection Service Ranking Service Re-ranking Service Server Server Server ... ... ... Re-ranking engines Selection engines Ranking engines Search Search query results Keyword-based Semantics-based ML/DL Models KV cluster Key-value store engines Inverted Vectorized Per-document index index forward index RocksDB MLFT

Our Scope in This Paper: Bing Web Search Auto-tuning Selection Selection Service Ranking Service Re-ranking Service engines to optimally select relevant documents Server Server Server ... ... ... Re-ranking engines Selection engines Ranking engines Search Search query results Keyword-based Semantics-based ML/DL Models KV cluster Key-value store engines Inverted Vectorized Per-document index index forward index RocksDB MLFT

Our Scope in This Paper: Bing Web Search Selection Service Ranking Service Re-ranking Service Auto-tuning Ranking models Server Server Server to optimally rank documents ... ... ... Re-ranking engines Selection engines Ranking engines Search Search query results Keyword-based Semantics-based ML/DL Models KV cluster Key-value store engines Inverted Vectorized Per-document index index forward index RocksDB MLFT

Our Scope in This Paper: Bing Web Search Selection Service Ranking Service Re-ranking Service Server Server Server ... ... ... Re-ranking engines Selection engines Ranking engines Auto-tuning key-value stores Search Search query results Keyword-based Semantics-based ML/DL Models to reduce lookup latency KV cluster Key-value store engines Inverted Vectorized Per-document index index forward index RocksDB MLFT

Towards A Unified Framework - AutoSys • Addressing common pain points in building learning-augmented systems • Job scheduling and prioritization for sequential optimization approaches • Handling learning-induced system failures (due to ML inference uncertainty) • Generality and extensibility • Lowering the cost of bootstrapping new scenarios, by sharing data and models • System deployments typically contain replicated service instances • Different system deployments can contain the same service • Facilitating computation resource sharing • Difficult to provision job resources • Jobs in AutoSys are ad-hoc and nondeterministic

Jo Jobs Within AutoSys Types Descriptions Examples Tuners Executes (1) ML/DL model training and inferencing, and (2) Hyperband, TPE, SMAC, optimization solver Metis, random search, … Trials Executes system explorations RocksDB , … • AutoSys jobs are ad-hoc: • Jobs are triggered in response to system and workload dynamics • AutoSys jobs are nondeterministic: • Jobs are spawned as necessary, according to optimization progress at runtime • Job completion time depends on system benchmarks and runtime (e.g., cache warmup)

Overview Target System #1 Target System #2 Control Interface Control Interface Training Plane Inference Plane Inference Plane Trial Manager Model Trainer Rule Engine Rule Engine Candidate Generator Model Repository Inference Runtime Inference Runtime

Overview – Learning Target System #1 Target System #2 Control Interface Control Interface Training Plane Inference Plane Inference Plane Trial Manager Model Trainer Rule Engine Rule Engine Candidate Generator Model Repository Inference Runtime Inference Runtime

Overview – Learning Target System #1 Target System #2 1.) From assessing current model progress, AutoSys Control Interface Control Interface generates benchmark candidates to iteratively improve the model • Exploration: benchmarks that are of high uncertainty • Exploitation: benchmarks that are likely being optimal • Re-sampling: benchmarks that likely contain measurement noises or Training Plane Inference Plane Inference Plane outliers Trial Manager Model Trainer Rule Engine Rule Engine Candidate Generator Model Repository Inference Runtime Inference Runtime

Overview – Learning 2.) AutoSys prioritizes benchmark candidates, according to Target System #1 Target System #2 how likely they would help discover the optimum in the Control Interface Control Interface search space • E.g., its Metis tuner uses Gaussian process to estimate the information gain • E.g., its TPE tuner uses two GMM to estimate the likelihood of a candidate being the optimum Training Plane Inference Plane Inference Plane Trial Manager Model Trainer Rule Engine Rule Engine Candidate Generator Model Repository Inference Runtime Inference Runtime

Overview – Auto-Tuning Actuations Target System #1 Target System #2 Control Interface Control Interface Training Plane Inference Plane Inference Plane Trial Manager Model Trainer Rule Engine Rule Engine Candidate Generator Model Repository Inference Runtime Inference Runtime

Overview – Auto-Tuning Actuations 3.) As it is difficult to formally verify ML/DL correctness, Target System #1 Target System #2 AutoSys opts to validate ML/DL outputs with a rule-based engine. Control Interface Control Interface • Useful for validating parameter value constraints and dependencies • Useful for preventing known bad configurations from be applied • Useful for implementing triggers based on the system’s actuation feedback Training Plane Inference Plane Inference Plane Trial Manager Model Trainer Rule Engine Rule Engine Candidate Generator Model Repository Inference Runtime Inference Runtime

Summary ry of f Production Deployments Tuning time Key results (vs. long-term expert tuning) Keyword-based Selection 1 week Up to 33.5% and 11.5% reduction in 99-percentile Engine (KSE) latency and CPU utilization, respectively Semantics-based Selection 1 week Up to 20.0% reduction in average latency Engine (SSE) Ranking Engine (RE) 1 week 3.4% improvement in NDCG@5 RocksDB key-value cluster 2 days Lookup latency on-par with years of expert tuning (RocksDB) Multi-level Time and 1 week 16.8% reduction on avg in 99-percentile latency Frequency-value cluster (MLTF)

Long-term Lessons Learned Higher-than-expected learning costs • Various types of system dynamics can frequently trigger re-training • System deployments can scale up/down over time • Workloads can drift over time • Learning large-scale system deployments can be costly • Testbeds might not match the scale and fidelity of the production environment • It is typically infeasible to explore system behavior in the production environment

Long-term Lessons Learned Pitfalls of human-in-the-Loop • Human experts can inject biases into training datasets • E.g., human experts can provide labeled data points for certain search space regions • Human errors can prevent AutoSys from functioning correctly • E.g., wrong parameter value ranges

Long-term Lessons Learned System control interfaces should abstract system measurements and logs to facilitate learning • Many systems distribute configuration parameters and error messages over a set of not-well documented files and logs • Many system feedbacks are not natively learnable, e.g., stack traces and core dump • Some systems require customized measurement aggregation and cleaning

Conclusion • This work reports our years of experience in designing and operating learning- augmented systems in production 1. AutoSys framework, for unifying the development at Microsoft 2. Long-term operation lessons • Core components of AutoSys are publicly available at https://github.com/Microsoft/nni

Mike Lian ang Systems and Networking Research Group Microsoft Research Asia liang.mike@microsoft.com www.microsoft.com/en-us/research/people/cmliang

AutoSys: The Design and Operation of Learning-Augmented Systems - PowerPoint PPT Presentation

AutoSys: The Design and Operation of Learning-Augmented Systems Chieh-Jan Mike Liang, Hui Xue, Mao Yang, Lidong Zhou, Lifei Zhu, Zhao Lucis Li, Zibo Wang, Qi Chen, Quanlu Zhang, Chuanjie Liu, Wenjun Dai Microsoft Research, Peking University,

Pump Design, Operation and Maintenance Steve Truitt, PE Design Considerations Operation

SIP Operation in SIP Operation in SIP Operation in 2003 2003 2003 Iptel.org builders of

Operation: River Watch Presented By: Alex Mrotek Operation: River Watch Our Purpose: Operation:

Village of Kingsley WWTP Operation, Maintenance and Management Began Operation June 2005 Began

Top-up Operation - Preservation of Bunch Pattern SPring-8 12/03/02 Michael B oge 31

More Slides on Division Operation in Relational Algebra Query Language (& together

Lattice optimization for low charge Lattice optimization for low charge state heavy ion operation

Data$Operation$Instructions 1 Data$Operation

The Use of Stability Maps The Use of Stability Maps in the Design and Operation of in the Design

What is SUDS design? PAUL DAVIES What is SUDS design? What is SUDS design? What is SUDS design?

Agile Software Design 19 February, 2020 Software Design Early decisions Modular design Agile

New Developments in the Design and Operation of the NEOS Server Robert Fourer Industrial

Operation and Maintenance of Operation and Maintenance of Onsite Waster Systems in Onsite Waster

Operation Overview Operation Overview and and Strategy Summary Strategy Summary December

NYISO Generation NYISO Generation Characteristics and Operation Characteristics and Operation

Installation and Trial Installation and Trial Operation of 35kV/121MVA Operation of 35kV/121MVA

Ramps Ramps Yes Yes Y Y A. A. No No B. B. Ramps 3 Ramps 4 Observations About Ramps

Parallelizing SCIP-SDP via the UG framework Tristan Gally joint work with Marc E. Pfetsch,

Integration of Renewable Resources David Hawkins and Clyde Loutan PSERC Presentation October 2,

MicroBooNE Status David Martinez Illinois Institute of Technology February 13th 2017 AEM

Ranking in Heterogeneous Networks with Geo-Location Information Leman Akoglu Abhinav Mishra

Largest Districts in Alabama Ranking School District ADM 1 049 Mobile County 53,419.40 2

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services 1. Overview 2.

Building OSGi Components Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 1 About

AutoSys: The Design and Operation of Learning-Augmented Systems - PowerPoint PPT Presentation

AutoSys: The Design and Operation of Learning-Augmented Systems Chieh-Jan Mike Liang, Hui Xue, Mao Yang, Lidong Zhou, Lifei Zhu, Zhao Lucis Li, Zibo Wang, Qi Chen, Quanlu Zhang, Chuanjie Liu, Wenjun Dai Microsoft Research, Peking University,

Pump Design, Operation and Maintenance Steve Truitt, PE Design Considerations Operation

SIP Operation in SIP Operation in SIP Operation in 2003 2003 2003 Iptel.org builders of

Operation: River Watch Presented By: Alex Mrotek Operation: River Watch Our Purpose: Operation:

Village of Kingsley WWTP Operation, Maintenance and Management Began Operation June 2005 Began

Top-up Operation - Preservation of Bunch Pattern SPring-8 12/03/02 Michael B oge 31

More Slides on Division Operation in Relational Algebra Query Language (&amp; together

Lattice optimization for low charge Lattice optimization for low charge state heavy ion operation

Data$Operation$Instructions 1 Data$Operation

The Use of Stability Maps The Use of Stability Maps in the Design and Operation of in the Design

What is SUDS design? PAUL DAVIES What is SUDS design? What is SUDS design? What is SUDS design?

Agile Software Design 19 February, 2020 Software Design Early decisions Modular design Agile

New Developments in the Design and Operation of the NEOS Server Robert Fourer Industrial

Operation and Maintenance of Operation and Maintenance of Onsite Waster Systems in Onsite Waster

Operation Overview Operation Overview and and Strategy Summary Strategy Summary December

NYISO Generation NYISO Generation Characteristics and Operation Characteristics and Operation

Installation and Trial Installation and Trial Operation of 35kV/121MVA Operation of 35kV/121MVA

Ramps Ramps Yes Yes Y Y A. A. No No B. B. Ramps 3 Ramps 4 Observations About Ramps

Parallelizing SCIP-SDP via the UG framework Tristan Gally joint work with Marc E. Pfetsch,

Integration of Renewable Resources David Hawkins and Clyde Loutan PSERC Presentation October 2,

MicroBooNE Status David Martinez Illinois Institute of Technology February 13th 2017 AEM

Ranking in Heterogeneous Networks with Geo-Location Information Leman Akoglu Abhinav Mishra

Largest Districts in Alabama Ranking School District ADM 1 049 Mobile County 53,419.40 2

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services 1. Overview 2.

Building OSGi Components Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 1 About

More Slides on Division Operation in Relational Algebra Query Language (& together