Overviews and practical reports
Justin Clarke, Cecilia Ferrando, William Rebelsky
Overviews and practical reports Justin Clarke, Cecilia Ferrando, - - PowerPoint PPT Presentation
Overviews and practical reports Justin Clarke, Cecilia Ferrando, William Rebelsky Trends The growing amount of data and need for Machine Learning data processing challenges us to advance systems One trillion IoT devices expected by
Justin Clarke, Cecilia Ferrando, William Rebelsky
advance systems
Machine Learning?
Source: https://mkomo.com/cost-per-gigabyte-update
Source: Forbes.com
system design.
1. How can an ML program be distributed over a cluster? 2. How can ML computation be bridged with inter-machine communication? 3. How can such communication be performed? 4. What should be communicated between machines?
○ Infinitely fast networks ○ All machines process at the same rate ○ No additional users/background tasks
○ Improve convergence rate (number of iterations) ○ Improve throughput (per-iteration time)
Machine Learning:
○ ML programs are robust to minor errors in intermediate steps
○ Parameters depend not only on the data but
○ Not all parameters converge in the same number of iterations
Traditional:
atomically correct
○ Speed of Execution ○ Ease of programmability ○ Correctness of solution
is not necessary in ML
○ A ready-to-run set of ML-workhorse implementations (eg MCMC) ○ ML distributed cluster OS that supports the above implementations
○ Compute prioritization of parameters (e.g. give mroe resources to the parameters that need them) ○ Workload balancing using slow-worker agnosticism ○ Create a Structure Aware Parallelization (SAP) for scheduling, prioritization, and load balancing
○ Workers wait at the end of iteration until everyone is finished ○ Issue: Don’t get the p-fold speed up ■ Synchronization barrier suffers from stragglers ■ Synchronization barrier can take longer than the iteration
○ Workers continue iterating and sending updates without waiting for others to finish ○ Issue: less progress per iteration ■ Information becomes stale ■ In the limit, errors can cause slow or incorrect convergence
○ Workers who get more than s iterations ahead of any other worker are stopped
○ Take advantage of the idea that in fully connected layers the top layers account for 90% of the parameters, but only 10% of the backpropagation cost
○ Naive synchronization is not instantaneous
○ Gradient computations can be decomposed to transmit S(K+D) instead of KD elements
PROS: 1. Enough relevant background information to explain the issues to people not already in the field 2. Strong justification for ML-researchers being involved in designing the systems 3. Separation of the issues into 4 major questions allows for more directed research moving forward CONS: 1. Section 4: Petuum a. Claims close to p-fold speedup but doesn’t show data b. Basic implementation that “might become the foundation of an ML distributed cluster operating system” 2. Inconsistent specificity: carefully state the ML models, general statements of the solution a. “Continuous communication can be achieved by a rate limiter in the SSP implementation” b. “SSP with properly selected staleness values “
DATACENTERS MACHINE LEARNING TASKS INFRASTRUCTURE EDGE
Inference on the edge to avoid latency
EDGE HARDWARE LIMITATIONS (low performance) SOFTWARE LIMITATIONS (diversity) OPTIMIZATION “HOW DOES FACEBOOK RUN INFERENCE AT THE EDGE?”
Mobile inference runs on old CPU cores CPU cores design year
There is no “standard” mobile SoC The most common SoC has
Mobile inference runs on old CPU cores
There is no “standard” mobile SoC Mobile inference runs on old CPU cores GPUs? “Holistic” optimization? DSPs?
There is no “standard” mobile SoC Mobile inference runs on old CPU cores GPUs? “Holistic” optimization? DSPs? Only 20% of the mobile SoCs have a GPU 3x more powerful than CPUs! (Apple devices stand out)
There is no “standard” mobile SoC Mobile inference runs on old CPU cores GPUs? “Holistic” optimization? DSPs? Digital Signal Processors (co-processors) have little support for vector structures Programmability is an issue Only available on 5% of the SoCs
There is no “standard” mobile SoC Mobile inference runs on old CPU cores GPUs? “Holistic” optimization? DSPs? Performance variability
broad support and CNN optimization
accelerate AI from research to production
Caffe2 Runtime
Two in-house libraries:
○ 32-bit floating-point precision ○ Winograd transform ○ Fast Fourier transform ○ High performance for convolution
○ 8-bit fixed-point precision ○ Augments NNPACK for low-intensity CNNs (grouped 1x1 and depthwise convolution)
Optimized convolution for mobile CPUs
Caffe2 Runtime
Two in-house libraries:
○ 32-bit floating-point precision ○ Winograd transform ○ Fast Fourier transform ○ High performance for convolution
○ 8-bit fixed-point precision ○ Augments NNPACK for low-intensity CNNs (grouped 1x1 and depthwise convolution)
Optimized convolution for mobile CPUs
Performance variability
Challenges:
10-20 FPS)
inference per second Explored solution:
Hexagon)
DSPs and CPUs Key DNN models:
○ Compact image representation ○ Channel pruning ○ Quantization
PROS: 1. Detailed analysis of the current state of edge inference and its challenges 2. Mobile hardware shortcomings stimulate research in the optimization of edge inference 3. Clear directions on how to advance edge inference CONS: 1. Despite the large amount of resources at Facebook, there is still a lot of work to do 2. Fragmentation of the smartphone hardware/software means that a holistic approach to optimizing edge inference is not possible 3. Benchmarking results from this paper is hard because "it would require a fleet of devices" (due to performance variability) 4. Even with fragmentation, “common denominator” solutions might still be possible without overly trading off efficiency
Facebook edge inference tools: edge inference challenges and proposed solution at Facebook Berkeley view:
directions in systems, architectures, and security Strategies and principles: ML systems design principles