ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM - - PowerPoint PPT Presentation

accelerator scheduling and management using a
SMART_READER_LITE
LIVE PREVIEW

ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM - - PowerPoint PPT Presentation

ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu 1 POPULARITY OF GPUS IN COMPUTING DOMAINS GPUs are


slide-1
SLIDE 1

1

ACCELERATOR SCHEDULING AND MANAGEMENT USING A RECOMMENDATION SYSTEM

David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu

slide-2
SLIDE 2

2

POPULARITY OF GPUS IN COMPUTING DOMAINS

§ GPUs are popular in two major compute

domains: § HPC: Supercomputing, cloud engines § SoC : Low power handheld devices

§ Growth of GPUs in HPC:

§ Avg. 10% GPU based supercomputers

every year in Top 500 listing1

§ Titan supercomputer records 27.1 PFLOPS

using Nvidia K20x GPUs

§ Growth in the SoC market:

§ 2.1x increase in GPU based SoC

shipments every year

§ Projected to increase to 4x annually

Source: http://top500.org/

Shipments (in Millions)

slide-3
SLIDE 3

3

PERFORMANCE: CPU VS. GPU (DP FLOPS)

SourceKarl Rupp’s website: www.karlrupp.net

slide-4
SLIDE 4

4

RANGE OF APPLICATIONS FOR MODERN GPUS

Traditional Compute

  • Data parallel
  • Good scalability
  • Example applications: FFT, matrix

multiply, NBody, convolution Machine Learning

  • Data dependent kernel launch
  • Multiple situation-based kernels
  • Fragmented parallelism
  • E.g., outlier detection, DNN,

Hidden Markov Models Signal Processing

  • Stage based computation kernels
  • Kernel replication (channels)
  • Performance bound (video)
  • E.g., Filtering, H264 audio, JPEG

compression, video processing Irregular computations

  • Synchronization (Sorting)
  • Nested parallelism (Graphs

search)

  • E.g., graph traversal, GPU

based sorting, list intersection System Operations and Security

  • Varied kernel sizes
  • Non-deterministic launch order
  • Independent compute kernels

with same data

  • E.g., encryption, garbage

collection, file-system tasks Modern GPU Applications

slide-5
SLIDE 5

5

KEY FEATURES REQUIRED IN GPUS TO SUPPORT MODERN APPLICATIONS

§ Collaborative Execution

ü Leverage multiple GPUs and the CPU concurrent to run a single problem

§ Applying Machine Learning approaches to tuning power/perf

ü Machine learning algorithms are being used everywhere

§ Load Balancing

ü Prevent starvation when executing multiple kernels together

§ Multiple Application Hosting

ü Beneficial for cloud GPUs to mitigate user load and improve power

§ Time, Power and Performance QoS Objectives

ü Maintain deadlines for low-latency applications

slide-6
SLIDE 6

6

MACHINE LEARNING BASED SCHEDULING FRAMEWORK FOR GPUS

§ GPU’s are already appearing as cloud-based instances:

§ Amazon Web Services § Nimbix § Peer1 Hosting § IBM Cloud § SkyScale § SoftLayer

§ Co-executing applications on GPUs may interfere with each other

§ Interference leads to individual slowdown and reduces system throughput

ü

Mystic: Framework for interference aware scheduling of applications on GPU clusters/clouds using machine learning

slide-7
SLIDE 7

7

INTERFERENCE ON MULTI-TASKED GPUS

§ Interference is any performance degradation caused by resource contention and

conflicts between co-executing applications

§ 45 applications launched on 8-GPU cluster using GPU-remoting

§ Least Loaded (LL) and Round Robin (RR) scheduler used § 41% avg. slowdown over dedicated § 8 applications show 2.5x slowdown § Only 6 applications achieve QoS (80% of dedicated performance)

slide-8
SLIDE 8

8

MYSTIC : MACHINE LEARNING BASED INTERFERENCE DETECTION

§ Mystic is a layer implemented on head

node of a cluster

§ Stage-1

§ Initialize Mystic § Create status entry for incoming applications § Collect short, limited profiles for new apps.

§ Stage-2

§ Predict missing metrics for short profiles using

Collaborative Filtering with SVD and the training matrix

§ Fill PRT with predicted performance values

§ Stage-3

§ Detect similarity between current and

existing applications using PRT and MAST

§ Generate interference scores § Select co-execution candidates

slide-9
SLIDE 9

9

EVALUATION METHODOLOGY

§ Mystic evaluated on private cluster

running rCUDA [UPV]

§ Node configuration

§ Xeon E5-2695 CPU § 2 Nvidia K40m GPUs § 24GB/16GB memory (head/compute) § IB FDR connect § 12GB GPU DRAM per device

§ 42 applications for training matrix

(TRM) creation

§ 55 applications for testing Mystic § 100 random launch sequences

Suite # Apps

Name of Applications

PolyBench

  • GPU

14

2dConv, 3dConv, 3mm, Atax, Bivg, Gemm, Gesummv, Gramschmidt, mvt, Syr2k, Syrk, Correlation, Covariance, FDTD-2d

SHOC-GPU 12

BFS-Shoc, FFT, MD, MD5Hash, NueralNet, Scan, Sort, Reduction, Spmv, Stencil-2D, Triad, qt- Clustering

Lonestar- GPU 6

BHNbody, BFS, LS, MST, DMR, SP, Sssp

NUPAR 5

CCL, LSS, HMM, LoKDR, IIR

  • penFOAM

8

Dsmc, PDR, thermo, buoyantS, mhd, simpleFoam, sprayFoam, driftFlux

ASC Proxy Apps 4

CoMD, LULESH, miniFE, miniMD

Caffe- cuDNN 3

leNet, logReg, fineTUne

MAGMA 3

Dasaxpycp, strstm, dgesv

slide-10
SLIDE 10

10

MYSTIC SCHEDULER PERFORMANCE

  • 90.2% applications

achieve QoS with Mystic

  • 75% applications show

less than 15% degradation with Mystic

  • RR achieves QoS for

21% applications

  • LL experiences

severe degradation

  • Short-running

interfering apps are co-scheduled

  • Mystic scales effectively as we

increase the number of nodes

  • LL utilizes GPUs more effectively

than RR for fewer nodes

slide-11
SLIDE 11

11

SUMMARIZING THE MYSTIC SCHEDULER

§ Mystic is an interference aware scheduler for GPU clusters § Collaborative filtering used to predict the performance of incoming applications § Mystic assigns applications for co-execution such that it minimizes the

interference

§ Evaluated Mystic on 16 node cluster with 55 applications and 100 launch

sequences “Mystic: Predictive Scheduling for GPU based Cloud Servers Using Machine Learning,”

  • U. Ukidave, X. Li and D. Kaeli, 30th IEEE International Parallel and Distributed Processing

Symposium, May, 2016.

slide-12
SLIDE 12

12

AIRAVAT: IMPROVING THE ENERGY EFFICIENCY OF HETEROGENEOUS SYSTEMS/APPLICATIONS

§ New platforms that can support concurrent CPU and GPU

execution

§ Peak performance/power can be achieved when both devices

can be used collaboratively

§ Emerging heterogeneous platforms (APUs, Jetson, etc.) have

the CPU and GPU share a single memory

§ Multiple/shared clock domains for CPU, GPU and memory § Airavat provides a framework that can improve the Energy

Delay Product on NVIDIA Jetson GPUs, providing significant power/performance benefits when compared to the NVIDIA baseline power manager

slide-13
SLIDE 13

13

AIRAVAT: 2-LEVEL POWER MANAGEMENT FRAMEWORK

Frequency Approximation Layer

  • Uses Random Forest

based prediction to select the best frequency tuple for the application Collaborative Tuning Layer

  • Leverages run-time

feedback based on performance counters to fine-tune CPU/GPU running collaboratively

slide-14
SLIDE 14

14

SUMMARY OF AIRAVAT

  • Average EDP reduction is 24%

and can be as high as 70% when compared to the baseline power manager

  • FAL provides up to 19% energy

savings with an additional 5% from CTL

  • Airavat achieves an application

speedup of 1.24x times

  • The EDP difference when

compared to a oracle power manager is 7%

  • Airavat causes power loss of

4% since it trades energy and performance for higher power

*Airavat: Improving Energy Efficiency of Heterogeneous Applications – Accepted to DATE 2018

slide-15
SLIDE 15

15

OTHER AREAS OF ONGOING RESEARCH IN NUCAR

§ Multi2Sim simulator – Kepler CPU/GPU timing simulator – ISPASS’17,

HSAsim ongoing

§ Multi-GPU and GPU-based NoC designs – HIPEAC’17 § NICE – Northeastern Interactive Clustering Engine, NIPS’17 § PCM reliability – DATE’17, DSN’17 – with Penn St. § GPU reliability – SELSE’16, SELSE’17, MWCSC’17, DATE’18 § GPU Hardware Transactional Memory EuroPar’17, IEEETC – with U. of

Malaga

§ GPU compilation – CGO’17, ongoing with with U. of Ghent § GPU scalar execution – ISPASS’13, IPDPS’14, IPDPS’16 § GPU recursive execution on graphs – APG’17

slide-16
SLIDE 16

16

THE NORTHEASTERN UNIVERSITY COMPUTER ARCHITECTURE RESEARCH LAB

slide-17
SLIDE 17

17

QUESTIONS?

THANKS TO OUR SPONSORS…..