[PPT] - ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM PowerPoint Presentation

SLIDE 1

1

ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM

David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu

SLIDE 2

2

POPULARITY OF GPUS IN COMPUTING DOMAINS

§ GPUs are popular in two major compute

domains: § HPC: Supercomputing, cloud engines § SoC : Low power handheld devices

§ Growth of GPUs in HPC:

§ Avg. 10% GPU based supercomputers

every year in Top 500 listing1

§ Titan supercomputer records 27.1 PFLOPS

using Nvidia K20x GPUs

§ Growth in the SoC market:

§ 2.1x increase in GPU based SoC

shipments every year

§ Projected to increase to 4x annually

Source: http://top500.org/

Shipments (in Millions)

SLIDE 3

3

PERFORMANCE: CPU VS. GPU (DP FLOPS)

SourceKarl Rupp’s website: www.karlrupp.net

SLIDE 4

4

RANGE OF APPLICATIONS FOR MODERN GPUS

Traditional Compute

Data parallel
Good scalability
Example applications: FFT, matrix

multiply, NBody, convolution Machine Learning

Data dependent kernel launch
Multiple situation-based kernels
Fragmented parallelism
E.g., outlier detection, DNN,

Hidden Markov Models Signal Processing

Stage based computation kernels
Kernel replication (channels)
Performance bound (video)
E.g., Filtering, H264 audio, JPEG

compression, video processing Irregular computations

Synchronization (Sorting)
Nested parallelism (Graphs

search)

E.g., graph traversal, GPU

based sorting, list intersection System Operations and Security

Varied kernel sizes
Non-deterministic launch order
Independent compute kernels

with same data

E.g., encryption, garbage

collection, file-system tasks Modern GPU Applications

SLIDE 5

5

KEY FEATURES REQUIRED IN GPUS TO SUPPORT MODERN APPLICATIONS

§ Collaborative Execution

ü Leverage multiple GPUs and the CPU concurrent to run a single problem

§ Applying Machine Learning approaches to tuning power/perf

ü Machine learning algorithms are being used everywhere

§ Load Balancing

ü Prevent starvation when executing multiple kernels together

§ Multiple Application Hosting

ü Beneficial for cloud GPUs to mitigate user load and improve power

§ Time, Power and Performance QoS Objectives

ü Maintain deadlines for low-latency applications

SLIDE 6

6

MACHINE LEARNING BASED SCHEDULING FRAMEWORK FOR GPUS

§ GPU’s are already appearing as cloud-based instances:

§ Amazon Web Services § Nimbix § Peer1 Hosting § IBM Cloud § SkyScale § SoftLayer

§ Co-executing applications on GPUs may interfere with each other

§ Interference leads to individual slowdown and reduces system throughput

ü

Mystic: Framework for interference aware scheduling of applications on GPU clusters/clouds using machine learning

SLIDE 7

7

INTERFERENCE ON MULTI-TASKED GPUS

§ Interference is any performance degradation caused by resource contention and

conflicts between co-executing applications

§ 45 applications launched on 8-GPU cluster using GPU-remoting

§ Least Loaded (LL) and Round Robin (RR) scheduler used § 41% avg. slowdown over dedicated § 8 applications show 2.5x slowdown § Only 6 applications achieve QoS (80% of dedicated performance)

SLIDE 8

8

MYSTIC : MACHINE LEARNING BASED INTERFERENCE DETECTION

§ Mystic is a layer implemented on head

node of a cluster

§ Stage-1

§ Initialize Mystic § Create status entry for incoming applications § Collect short, limited profiles for new apps.

§ Stage-2

§ Predict missing metrics for short profiles using

Collaborative Filtering with SVD and the training matrix

§ Fill PRT with predicted performance values

§ Stage-3

§ Detect similarity between current and

existing applications using PRT and MAST

§ Generate interference scores § Select co-execution candidates

SLIDE 9

9

EVALUATION METHODOLOGY

§ Mystic evaluated on private cluster

running rCUDA [UPV]

§ Node configuration

§ Xeon E5-2695 CPU § 2 Nvidia K40m GPUs § 24GB/16GB memory (head/compute) § IB FDR connect § 12GB GPU DRAM per device

§ 42 applications for training matrix

(TRM) creation

§ 55 applications for testing Mystic § 100 random launch sequences

Suite # Apps

Name of Applications

PolyBench

GPU

14

2dConv, 3dConv, 3mm, Atax, Bivg, Gemm, Gesummv, Gramschmidt, mvt, Syr2k, Syrk, Correlation, Covariance, FDTD-2d

SHOC-GPU 12

BFS-Shoc, FFT, MD, MD5Hash, NueralNet, Scan, Sort, Reduction, Spmv, Stencil-2D, Triad, qt- Clustering

Lonestar- GPU 6

BHNbody, BFS, LS, MST, DMR, SP, Sssp

NUPAR 5

CCL, LSS, HMM, LoKDR, IIR

penFOAM

8

Dsmc, PDR, thermo, buoyantS, mhd, simpleFoam, sprayFoam, driftFlux

ASC Proxy Apps 4

CoMD, LULESH, miniFE, miniMD

Caffe- cuDNN 3

leNet, logReg, fineTUne

MAGMA 3

Dasaxpycp, strstm, dgesv

SLIDE 10

10

MYSTIC SCHEDULER PERFORMANCE

90.2% applications

achieve QoS with Mystic

75% applications show

less than 15% degradation with Mystic

RR achieves QoS for

21% applications

LL experiences

severe degradation

Short-running

interfering apps are co-scheduled

Mystic scales effectively as we

increase the number of nodes

LL utilizes GPUs more effectively

than RR for fewer nodes

SLIDE 11

11

SUMMARIZING THE MYSTIC SCHEDULER

§ Mystic is an interference aware scheduler for GPU clusters § Collaborative filtering used to predict the performance of incoming applications § Mystic assigns applications for co-execution such that it minimizes the

interference

§ Evaluated Mystic on 16 node cluster with 55 applications and 100 launch

sequences “Mystic: Predictive Scheduling for GPU based Cloud Servers Using Machine Learning,”

U. Ukidave, X. Li and D. Kaeli, 30th IEEE International Parallel and Distributed Processing

Symposium, May, 2016.

SLIDE 12

12

AIRAVAT: IMPROVING THE ENERGY EFFICIENCY OF HETEROGENEOUS SYSTEMS/APPLICATIONS

§ New platforms that can support concurrent CPU and GPU

execution

§ Peak performance/power can be achieved when both devices

can be used collaboratively

§ Emerging heterogeneous platforms (APUs, Jetson, etc.) have

the CPU and GPU share a single memory

§ Multiple/shared clock domains for CPU, GPU and memory § Airavat provides a framework that can improve the Energy

Delay Product on NVIDIA Jetson GPUs, providing significant power/performance benefits when compared to the NVIDIA baseline power manager

SLIDE 13

13

AIRAVAT: 2-LEVEL POWER MANAGEMENT FRAMEWORK

Frequency Approximation Layer

Uses Random Forest

based prediction to select the best frequency tuple for the application Collaborative Tuning Layer

Leverages run-time

feedback based on performance counters to fine-tune CPU/GPU running collaboratively

SLIDE 14

14

SUMMARY OF AIRAVAT

Average EDP reduction is 24%

and can be as high as 70% when compared to the baseline power manager

FAL provides up to 19% energy

savings with an additional 5% from CTL

Airavat achieves an application

speedup of 1.24x times

The EDP difference when

compared to a oracle power manager is 7%

Airavat causes power loss of

4% since it trades energy and performance for higher power

*Airavat: Improving Energy Efficiency of Heterogeneous Applications – Accepted to DATE 2018

SLIDE 15

15

OTHER AREAS OF ONGOING RESEARCH IN NUCAR

§ Multi2Sim simulator – Kepler CPU/GPU timing simulator – ISPASS’17,

HSAsim ongoing

§ Multi-GPU and GPU-based NoC designs – HIPEAC’17 § NICE – Northeastern Interactive Clustering Engine, NIPS’17 § PCM reliability – DATE’17, DSN’17 – with Penn St. § GPU reliability – SELSE’16, SELSE’17, MWCSC’17, DATE’18 § GPU Hardware Transactional Memory EuroPar’17, IEEETC – with U. of

Malaga

§ GPU compilation – CGO’17, ongoing with with U. of Ghent § GPU scalar execution – ISPASS’13, IPDPS’14, IPDPS’16 § GPU recursive execution on graphs – APG’17

SLIDE 16

16

THE NORTHEASTERN UNIVERSITY COMPUTER ARCHITECTURE RESEARCH LAB

SLIDE 17

17