1
ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM - - PowerPoint PPT Presentation
ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM - - PowerPoint PPT Presentation
ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu 1 POPULARITY OF GPUS IN COMPUTING DOMAINS GPUs are
2
POPULARITY OF GPUS IN COMPUTING DOMAINS
§ GPUs are popular in two major compute
domains: § HPC: Supercomputing, cloud engines § SoC : Low power handheld devices
§ Growth of GPUs in HPC:
§ Avg. 10% GPU based supercomputers
every year in Top 500 listing1
§ Titan supercomputer records 27.1 PFLOPS
using Nvidia K20x GPUs
§ Growth in the SoC market:
§ 2.1x increase in GPU based SoC
shipments every year
§ Projected to increase to 4x annually
Source: http://top500.org/
Shipments (in Millions)
3
PERFORMANCE: CPU VS. GPU (DP FLOPS)
SourceKarl Rupp’s website: www.karlrupp.net
4
RANGE OF APPLICATIONS FOR MODERN GPUS
Traditional Compute
- Data parallel
- Good scalability
- Example applications: FFT, matrix
multiply, NBody, convolution Machine Learning
- Data dependent kernel launch
- Multiple situation-based kernels
- Fragmented parallelism
- E.g., outlier detection, DNN,
Hidden Markov Models Signal Processing
- Stage based computation kernels
- Kernel replication (channels)
- Performance bound (video)
- E.g., Filtering, H264 audio, JPEG
compression, video processing Irregular computations
- Synchronization (Sorting)
- Nested parallelism (Graphs
search)
- E.g., graph traversal, GPU
based sorting, list intersection System Operations and Security
- Varied kernel sizes
- Non-deterministic launch order
- Independent compute kernels
with same data
- E.g., encryption, garbage
collection, file-system tasks Modern GPU Applications
5
KEY FEATURES REQUIRED IN GPUS TO SUPPORT MODERN APPLICATIONS
§ Collaborative Execution
ü Leverage multiple GPUs and the CPU concurrent to run a single problem
§ Applying Machine Learning approaches to tuning power/perf
ü Machine learning algorithms are being used everywhere
§ Load Balancing
ü Prevent starvation when executing multiple kernels together
§ Multiple Application Hosting
ü Beneficial for cloud GPUs to mitigate user load and improve power
§ Time, Power and Performance QoS Objectives
ü Maintain deadlines for low-latency applications
6
MACHINE LEARNING BASED SCHEDULING FRAMEWORK FOR GPUS
§ GPU’s are already appearing as cloud-based instances:
§ Amazon Web Services § Nimbix § Peer1 Hosting § IBM Cloud § SkyScale § SoftLayer
§ Co-executing applications on GPUs may interfere with each other
§ Interference leads to individual slowdown and reduces system throughput
ü
Mystic: Framework for interference aware scheduling of applications on GPU clusters/clouds using machine learning
7
INTERFERENCE ON MULTI-TASKED GPUS
§ Interference is any performance degradation caused by resource contention and
conflicts between co-executing applications
§ 45 applications launched on 8-GPU cluster using GPU-remoting
§ Least Loaded (LL) and Round Robin (RR) scheduler used § 41% avg. slowdown over dedicated § 8 applications show 2.5x slowdown § Only 6 applications achieve QoS (80% of dedicated performance)
8
MYSTIC : MACHINE LEARNING BASED INTERFERENCE DETECTION
§ Mystic is a layer implemented on head
node of a cluster
§ Stage-1
§ Initialize Mystic § Create status entry for incoming applications § Collect short, limited profiles for new apps.
§ Stage-2
§ Predict missing metrics for short profiles using
Collaborative Filtering with SVD and the training matrix
§ Fill PRT with predicted performance values
§ Stage-3
§ Detect similarity between current and
existing applications using PRT and MAST
§ Generate interference scores § Select co-execution candidates
9
EVALUATION METHODOLOGY
§ Mystic evaluated on private cluster
running rCUDA [UPV]
§ Node configuration
§ Xeon E5-2695 CPU § 2 Nvidia K40m GPUs § 24GB/16GB memory (head/compute) § IB FDR connect § 12GB GPU DRAM per device
§ 42 applications for training matrix
(TRM) creation
§ 55 applications for testing Mystic § 100 random launch sequences
Suite # Apps
Name of Applications
PolyBench
- GPU
14
2dConv, 3dConv, 3mm, Atax, Bivg, Gemm, Gesummv, Gramschmidt, mvt, Syr2k, Syrk, Correlation, Covariance, FDTD-2d
SHOC-GPU 12
BFS-Shoc, FFT, MD, MD5Hash, NueralNet, Scan, Sort, Reduction, Spmv, Stencil-2D, Triad, qt- Clustering
Lonestar- GPU 6
BHNbody, BFS, LS, MST, DMR, SP, Sssp
NUPAR 5
CCL, LSS, HMM, LoKDR, IIR
- penFOAM
8
Dsmc, PDR, thermo, buoyantS, mhd, simpleFoam, sprayFoam, driftFlux
ASC Proxy Apps 4
CoMD, LULESH, miniFE, miniMD
Caffe- cuDNN 3
leNet, logReg, fineTUne
MAGMA 3
Dasaxpycp, strstm, dgesv
10
MYSTIC SCHEDULER PERFORMANCE
- 90.2% applications
achieve QoS with Mystic
- 75% applications show
less than 15% degradation with Mystic
- RR achieves QoS for
21% applications
- LL experiences
severe degradation
- Short-running
interfering apps are co-scheduled
- Mystic scales effectively as we
increase the number of nodes
- LL utilizes GPUs more effectively
than RR for fewer nodes
11
SUMMARIZING THE MYSTIC SCHEDULER
§ Mystic is an interference aware scheduler for GPU clusters § Collaborative filtering used to predict the performance of incoming applications § Mystic assigns applications for co-execution such that it minimizes the
interference
§ Evaluated Mystic on 16 node cluster with 55 applications and 100 launch
sequences “Mystic: Predictive Scheduling for GPU based Cloud Servers Using Machine Learning,”
- U. Ukidave, X. Li and D. Kaeli, 30th IEEE International Parallel and Distributed Processing
Symposium, May, 2016.
12
AIRAVAT: IMPROVING THE ENERGY EFFICIENCY OF HETEROGENEOUS SYSTEMS/APPLICATIONS
§ New platforms that can support concurrent CPU and GPU
execution
§ Peak performance/power can be achieved when both devices
can be used collaboratively
§ Emerging heterogeneous platforms (APUs, Jetson, etc.) have
the CPU and GPU share a single memory
§ Multiple/shared clock domains for CPU, GPU and memory § Airavat provides a framework that can improve the Energy
Delay Product on NVIDIA Jetson GPUs, providing significant power/performance benefits when compared to the NVIDIA baseline power manager
13
AIRAVAT: 2-LEVEL POWER MANAGEMENT FRAMEWORK
Frequency Approximation Layer
- Uses Random Forest
based prediction to select the best frequency tuple for the application Collaborative Tuning Layer
- Leverages run-time
feedback based on performance counters to fine-tune CPU/GPU running collaboratively
14
SUMMARY OF AIRAVAT
- Average EDP reduction is 24%
and can be as high as 70% when compared to the baseline power manager
- FAL provides up to 19% energy
savings with an additional 5% from CTL
- Airavat achieves an application
speedup of 1.24x times
- The EDP difference when
compared to a oracle power manager is 7%
- Airavat causes power loss of
4% since it trades energy and performance for higher power
*Airavat: Improving Energy Efficiency of Heterogeneous Applications – Accepted to DATE 2018
15
OTHER AREAS OF ONGOING RESEARCH IN NUCAR
§ Multi2Sim simulator – Kepler CPU/GPU timing simulator – ISPASS’17,
HSAsim ongoing
§ Multi-GPU and GPU-based NoC designs – HIPEAC’17 § NICE – Northeastern Interactive Clustering Engine, NIPS’17 § PCM reliability – DATE’17, DSN’17 – with Penn St. § GPU reliability – SELSE’16, SELSE’17, MWCSC’17, DATE’18 § GPU Hardware Transactional Memory EuroPar’17, IEEETC – with U. of
Malaga
§ GPU compilation – CGO’17, ongoing with with U. of Ghent § GPU scalar execution – ISPASS’13, IPDPS’14, IPDPS’16 § GPU recursive execution on graphs – APG’17
16
THE NORTHEASTERN UNIVERSITY COMPUTER ARCHITECTURE RESEARCH LAB
17