Overviews and practical reports Justin Clarke, Cecilia Ferrando, - - PowerPoint PPT Presentation

overviews and practical reports
SMART_READER_LITE
LIVE PREVIEW

Overviews and practical reports Justin Clarke, Cecilia Ferrando, - - PowerPoint PPT Presentation

Overviews and practical reports Justin Clarke, Cecilia Ferrando, William Rebelsky Trends The growing amount of data and need for Machine Learning data processing challenges us to advance systems One trillion IoT devices expected by


slide-1
SLIDE 1

Overviews and practical reports

Justin Clarke, Cecilia Ferrando, William Rebelsky

slide-2
SLIDE 2

Trends

  • The growing amount of data and need for Machine Learning data processing challenges us to

advance systems

  • One trillion IoT devices expected by 2035
  • More need for supporting Machine Learning computation on edge devices
slide-3
SLIDE 3

Specific challenges

  • Who should design better systems for Machine Learning?
  • What research is needed to address systems issues for

Machine Learning?

  • How do we improve AI support on edge devices?
slide-4
SLIDE 4

Why are these issues important today?

slide-5
SLIDE 5

Source: https://mkomo.com/cost-per-gigabyte-update

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

Source: Forbes.com

slide-9
SLIDE 9

Specific problems being addressed in these works

  • How do we develop better systems for ML?
  • Who should do so?
  • What needs to be done in order to improve the systems?
  • What research directions should we be taking?
  • How do we advance inference on edge devices?
slide-10
SLIDE 10

Strategies and Principles of Distributed Machine Learning on Big Data

Eric P. Xing *, Qirong Ho, Pengtao Xie, Dai Wei

slide-11
SLIDE 11

Overview

  • Rise of big Data -> more demands on Machine Learning
  • More demands on Machine Learning -> larger clusters necessary
  • Larger clusters necessary -> Someone needs to engineer the systems
  • Should this fall to Machine Learning Researchers or Systems researchers?
  • View: Big ML benefits from ML-rooted insights. Therefore ML researchers should do the

system design.

slide-12
SLIDE 12

Four Key Questions

1. How can an ML program be distributed over a cluster? 2. How can ML computation be bridged with inter-machine communication? 3. How can such communication be performed? 4. What should be communicated between machines?

slide-13
SLIDE 13

Background on Machine Learning

  • ML is becoming the primary mechanism for distilling raw-data into useable insights
  • Most ML programs belong to one of a few families of well developed approaches
  • Conventional ML R&D excels in model, algorithm, and theory development
  • In general, ML is an optimization problem:
slide-14
SLIDE 14

Distributed Machine Learning

  • Given P systems, we would expect P-fold increase in performance
  • However, current state of the art shows less than ½P-fold increase in performance
  • Much of this is due to idealized assumptions in research:

○ Infinitely fast networks ○ All machines process at the same rate ○ No additional users/background tasks

  • Two obvious research avenues to improve performance:

○ Improve convergence rate (number of iterations) ○ Improve throughput (per-iteration time)

slide-15
SLIDE 15

Machine Learning vs Traditional Programs

Machine Learning:

  • Error tolerance:

○ ML programs are robust to minor errors in intermediate steps

  • Dynamic structural dependencies:

○ Parameters depend not only on the data but

  • n each other
  • Non-uniform convergence:

○ Not all parameters converge in the same number of iterations

Traditional:

  • Transaction-centric
  • Only execute correctly if each step is

atomically correct

slide-16
SLIDE 16

State of Current platforms

  • Current platforms are general-purpose
  • For each software: there is a tradeoff between:

○ Speed of Execution ○ Ease of programmability ○ Correctness of solution

  • Current systems ensure that the outcome of each program is perfectly reproducible, which

is not necessary in ML

  • This paper: Usable software should instead offer two utilities:

○ A ready-to-run set of ML-workhorse implementations (eg MCMC) ○ ML distributed cluster OS that supports the above implementations

slide-17
SLIDE 17
  • 1. How can an ML program be distributed
  • ver a cluster?
  • Big data can be parallelized using either model parallelism or data parallelism strategies
  • ML requires a mix of both
  • Improved efficiency:

○ Compute prioritization of parameters (e.g. give mroe resources to the parameters that need them) ○ Workload balancing using slow-worker agnosticism ○ Create a Structure Aware Parallelization (SAP) for scheduling, prioritization, and load balancing

slide-18
SLIDE 18
  • 2. How can ML computation be bridged

with inter-machine communication?

slide-19
SLIDE 19

Bridging Models: Current state

  • Bulk Synchronous parallel bridging model (BSP):

○ Workers wait at the end of iteration until everyone is finished ○ Issue: Don’t get the p-fold speed up ■ Synchronization barrier suffers from stragglers ■ Synchronization barrier can take longer than the iteration

  • Asynchronous execution:

○ Workers continue iterating and sending updates without waiting for others to finish ○ Issue: less progress per iteration ■ Information becomes stale ■ In the limit, errors can cause slow or incorrect convergence

slide-20
SLIDE 20

Bridging Models: Solution

  • Combine the two models and get the best of both worlds
  • Stale Synchronous Parallel (SSP) bridging

○ Workers who get more than s iterations ahead of any other worker are stopped

slide-21
SLIDE 21
  • 3. How can such communication be

performed?

  • Continuous communication with rate limiters in the SSP implementation
  • Wait-free Backpropagation

○ Take advantage of the idea that in fully connected layers the top layers account for 90% of the parameters, but only 10% of the backpropagation cost

  • Update prioritization based on parameters that change the most (absolute or relative)
  • Decentralized storage using Halton topology
slide-22
SLIDE 22
  • 4. What should be communicated between

machines?

  • Typical clusters can transmit at most a few gigabytes per second between two machines

○ Naive synchronization is not instantaneous

  • Sufficient Factor Broadcasting:

○ Gradient computations can be decomposed to transmit S(K+D) instead of KD elements

  • Convergence is still guaranteed (although it can take extra iterations)
slide-23
SLIDE 23

Strategies and Principles of Distributed Machine Learning on Big Data

PROS: 1. Enough relevant background information to explain the issues to people not already in the field 2. Strong justification for ML-researchers being involved in designing the systems 3. Separation of the issues into 4 major questions allows for more directed research moving forward CONS: 1. Section 4: Petuum a. Claims close to p-fold speedup but doesn’t show data b. Basic implementation that “might become the foundation of an ML distributed cluster operating system” 2. Inconsistent specificity: carefully state the ML models, general statements of the solution a. “Continuous communication can be achieved by a rate limiter in the SSP implementation” b. “SSP with properly selected staleness values “

slide-24
SLIDE 24

A Berkeley View of Systems Challenges for AI

Ion Stoica, Dawn Song, Raluca Ada Popa, David Patterson, Michael W. Mahoney, Randy Katz, Anthony D. Joseph, Michael Jordan, Joseph M. Hellerstein, Joseph Gonzalez, Ken Goldberg, Ali Ghodsi, David Culler, Pieter Abbeel

slide-25
SLIDE 25

Trends:

  • Mission critical AI
  • Personalized AI
  • AI across organizations
  • AI demands outpacing Moore’s Law
slide-26
SLIDE 26
slide-27
SLIDE 27

Acting in Dynamic Environments

  • Continual Learning
  • Robust Decisions
  • Explainable Decisions
slide-28
SLIDE 28

Secure AI

  • Secure enclaves
  • Adversarial learning
  • Shared learning on confidential data
slide-29
SLIDE 29

AI-specific architectures

  • Domain specific hardware
  • composable AI systems
  • Cloud-edge systems
slide-30
SLIDE 30

Strengths and Weaknesses

slide-31
SLIDE 31

[Cecilia]

slide-32
SLIDE 32

Machine Learning at Facebook

  • Ranking posts
  • Content understanding
  • Object detection
  • Virtual reality
  • Speech recognition
  • Translation

DATACENTERS MACHINE LEARNING TASKS INFRASTRUCTURE EDGE

Inference on the edge to avoid latency

slide-33
SLIDE 33

Challenges of edge inference

EDGE HARDWARE LIMITATIONS (low performance) SOFTWARE LIMITATIONS (diversity) OPTIMIZATION “HOW DOES FACEBOOK RUN INFERENCE AT THE EDGE?”

slide-34
SLIDE 34

Challenges of edge inference

Mobile inference runs on old CPU cores CPU cores design year

slide-35
SLIDE 35

Challenges of edge inference

There is no “standard” mobile SoC The most common SoC has

  • nly 4% of the market

Mobile inference runs on old CPU cores

slide-36
SLIDE 36

Challenges of edge inference

There is no “standard” mobile SoC Mobile inference runs on old CPU cores GPUs? “Holistic” optimization? DSPs?

slide-37
SLIDE 37

Challenges of edge inference

There is no “standard” mobile SoC Mobile inference runs on old CPU cores GPUs? “Holistic” optimization? DSPs? Only 20% of the mobile SoCs have a GPU 3x more powerful than CPUs! (Apple devices stand out)

slide-38
SLIDE 38

Challenges of edge inference

There is no “standard” mobile SoC Mobile inference runs on old CPU cores GPUs? “Holistic” optimization? DSPs? Digital Signal Processors (co-processors) have little support for vector structures Programmability is an issue Only available on 5% of the SoCs

slide-39
SLIDE 39

Challenges of edge inference

There is no “standard” mobile SoC Mobile inference runs on old CPU cores GPUs? “Holistic” optimization? DSPs? Performance variability

slide-40
SLIDE 40

Facebook mobile inference tools

  • FBLearner workflow and optimization
  • Caffe2 runs on mobile and is designed for

broad support and CNN optimization

  • New version of PyTorch designed to

accelerate AI from research to production

slide-41
SLIDE 41

Facebook mobile inference tools

slide-42
SLIDE 42

Facebook mobile inference tools

Caffe2 Runtime

Two in-house libraries:

  • NNPACK

○ 32-bit floating-point precision ○ Winograd transform ○ Fast Fourier transform ○ High performance for convolution

  • QNNPACK

○ 8-bit fixed-point precision ○ Augments NNPACK for low-intensity CNNs (grouped 1x1 and depthwise convolution)

Optimized convolution for mobile CPUs

slide-43
SLIDE 43

DNNs on smartphones

Caffe2 Runtime

Two in-house libraries:

  • NNPACK

○ 32-bit floating-point precision ○ Winograd transform ○ Fast Fourier transform ○ High performance for convolution

  • QNNPACK

○ 8-bit fixed-point precision ○ Augments NNPACK for low-intensity CNNs (grouped 1x1 and depthwise convolution)

Optimized convolution for mobile CPUs

Performance variability

slide-44
SLIDE 44

Oculus

Challenges:

  • Models must run at 30-60 FPS (instead of

10-20 FPS)

  • Multiple cameras, order of hundreds of

inference per second Explored solution:

  • Offloading DNNs to DSPs (Qualcomm

Hexagon)

  • Compare performance across models using

DSPs and CPUs Key DNN models:

  • Hand Tracking
  • Image Classification Model-1
  • Image Classification Model-2
  • Pose Estimation
  • Action Segmentation
slide-45
SLIDE 45

Oculus: CPU vs DSP

slide-46
SLIDE 46

Limitations

  • Benchmarking results is difficult because of performance variability across SoCs
  • Accuracy vs model size trade-off
  • Still far from holistic system solution
slide-47
SLIDE 47

Discussion

  • Performance optimization is critical for mobile inference
  • Mobile ecosystem is very fragmented (makes analysis and evaluation harder)
  • “Common denominator” solutions privilege ubiquity and programmability but sacrifice efficiency
  • Main optimization opportunities:

○ Compact image representation ○ Channel pruning ○ Quantization

  • Heterogeneity of mobile hardware/software leads to performance variability
  • Far from holistic system solutions that are generalizable
slide-48
SLIDE 48

Facebook’s edge inference

PROS: 1. Detailed analysis of the current state of edge inference and its challenges 2. Mobile hardware shortcomings stimulate research in the optimization of edge inference 3. Clear directions on how to advance edge inference CONS: 1. Despite the large amount of resources at Facebook, there is still a lot of work to do 2. Fragmentation of the smartphone hardware/software means that a holistic approach to optimizing edge inference is not possible 3. Benchmarking results from this paper is hard because "it would require a fleet of devices" (due to performance variability) 4. Even with fragmentation, “common denominator” solutions might still be possible without overly trading off efficiency

slide-49
SLIDE 49

Questions?

slide-50
SLIDE 50

Comparison

Facebook edge inference tools: edge inference challenges and proposed solution at Facebook Berkeley view:

  • pen research

directions in systems, architectures, and security Strategies and principles: ML systems design principles