CPU-GPU Heterogeneous Computing Advanced Seminar "Computer - PowerPoint PPT Presentation

CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering” Winter-T erm 2015/16 Stefgen Lammel December 2, 2015 1 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Content ● Energy Saving ● Introduction with HCT – Motivation – Intelligent workload – Characteristics of CPUs and GPUs division ● Heterogeneous – Dynamic Computing Systems Voltage/Frequency Scaling (DVFS) and Techniques ● Conclusion – Workload division – Frameworks and tools – Programming aspects – Fused HCS – Energy aspects 2 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Introduction 3 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Introduction Source: [2] ● Grand Goal in HPC – Exascale systems until the year ~2020 ● Problems – Computational Power ● Now: up to 7GF/W ● Exascale: >=50GF/W – Power Budget Source: [1] ~20MW Compare: #1 TOP500: ~33PF @ 1,9GF/W – Heat Dissipation 4 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Introduction ● CPU – Few cores (<= 20) – High frequency (~3GHz) – Large caches, plenty of (slow) memory (<= 1TB) – Latency oriented ● GPU – Many cores (> 1000) – Slow frequency (<=1GHz) – Fast memory, limited in size (<= 12GB) – Throughput oriented 5 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Introduction ● Ways increase Terminology: Energy Effjciency: HCS : Heterogeneous Computing – System (hardware) – Get the most HCT : Heterogeneous Computing – computational T echnique (software) power from both PU : Processing Unit (can be both, CPU – and GPU) domains FLOPs : Floating Point Operations per – second – Utilize the DP : Double Precision ● sophisticated power- SP : Single Precision ● saving techniques BLAS : Basic Linear Algebra – Subprograms modern CPU/GPUs SIMD : Single Instruction Multiple Data – ofger 6 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Heterogeneous Computing Techniques (HCT) Runtime Level 7 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT - Basics ● Worst case: ● Ideal case: – Only one PU is – All PUs do (useful) active at a time work simultaneously CPU GPU CPU GPU 8 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT - Basics ● Examples are idealized – Real world applications consist of several difgerent patterns ● Typical Processing Units (PU) in HCS – T ens of CPU cores/threads – Several 1000 GPU cores/kernels ● Goals of HCT – All PUs have to be utilized (in a useful way) 9 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT – Workload Division ● Basic Idea: – Divide the whole problem Problem into smaller chunks – Assign each sub-task to a PU – compare: “PCAM”, [5] ● Partition ● Communicate Sub-Task ... ... ... 0 ● Agglomerate ● Map Sub-Task Sub-Task ... ... 1 n 10 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT – Workload Division (naive) Example: ● Dual-Core System – CPU + GPU – Naive data distribution ● CPU core 0 – Master/Arbiter ● CPU core 1 – Worker ● GPU – Worker core0 GPU core1 ● Huge idle periods for GPU and CPU core 0 11 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT – Workload Division (relative PU performance) ● Approach: use relative performance of each PU as metric – A microbenchmark or performance model deemed the GPU 3x faster than than the CPU – Partition the work in a 3:1 ratio to the PUs – T ask granularity and the quality/nature of the core0 GPU core1 microbenchmark are the key factors here 12 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT – Workload Division (characteristics of sub-tasks) ● Idea: – Use the nature of Problem the sub-tasks to leverage performance – CPU affjne tasks – GPU affjne tasks – tasks which run Sub-Task ... ... ... 0 roughly equally Sub-Task Sub-Task ... ... well on all PUs 1 n 13 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT – Workload Division (nature of sub-tasks) ● Map the tasks to the PU it performs best on – Latency: CPU – Throughput: GPU ● Further scheduling metrics: – Capability (of the PU) – Locality (of the data) CPU1 CPU0 GPU – Criticality (of the task) – Availability (of the PU) 14 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT – Workload Division (pipeline) ● If overlap is possible: Pipeline – Call kernels asynchronously to hide latency – Small penalty to fjll and drain the pipeline – Good utilization of all PUs if the pipeline is full Task A.3 Task B.3 Task C.3 PUn Task A.2 Task B.2 Task C.2 PU1 PU0 Task A.1 Task B.1 Task C.1 15 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT – Workload Division (relative PU performance) Summary: Metrics for workload division ● Performance of PUs ● Historical Data ● Nature of sub-tasks – How well did each PU perform in the previous step? – Order ● Availability of PU ● Regular Patterns --> GPU (little – Is there a function/kernel for the desired PU? communication) – Is the PU able to take a task (scheduling- ● Irregular Patterns --> CPU (lots of wise)? communication) – Memory Footprint ● Fits into VRAM? --> GPU ● T oo Big? --> CPU – BLAS-Level ● BLAS-1/2 --> CPU (Vector-Vector. Vector- Matrix operations) ● BLAS-3 --> GPU (Matrix-Matrix operations 16 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Heterogeneous Computing Techniques (HCT) Frameworks and tools 17 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT – Framework Support ● Implementing these techniques is tedious and error-prone – Better: Let a framework do this job! ● Framework for load-balancing – Compile-Time Level (static scheduling) – Runtime Level (dynamic scheduling) ● Framework for parallel-abstraction – Write the algorithm as a sequential program and let the tools fjgure out how to utilize the PUs optimally – Sourcecode annotations to give the run-time/compiler hints what approach is the best (comp.: OpenMP #pragma_omp_xxx) – Scheduling: dynamic or static ● Partitioning and work-division principles shown before apply here as well! 18 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT – Framework Support ● Generic PU specifjc tools and frameworks – CUDA+Libraries (Nvidia GPU) – OpenMP, Pthreads (CPU) ● Generic heterogenous-aware frameworks – OpenCL, OpenACC – OpenMP (“offmoading”, since v4.0) – CUDA (CPU-callback) ● Custom Frameworks (interesting Examples) – Compile-Time Level (static scheduling approach) ● SnuCL – Run-Time Level (dynamic scheduling approach) ● PLASMA 19 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT – Framework Support (Example: SnuCL) ● Creates a “virtual node” with all the PUs of a Cluster ● Use a message passing interface (MPI) to distribute the workloads to the distant PUs ● Inter-Node communication is implicit Source: [4] 20 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HCT – Framework Support (Example: PLASMA) ● Intermediate Original representation Code – Independent of PU IR Code ● PU-specifjc implementation based on IR CPU GPU – Utilizes the PU's specifjc SIMD Code Code capabilities Runtime ● A Runtime decides assignment to PU dynamically GPU CPU – Speed-Up depends on workload 21 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Heterogeneous Computing Techniques (HCT) Fused HCS 22 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

HTC – Fused HCS ● CPU and GPU share the same die – … and the same address space! ● Communication paths are signifjcantly shorter ● AMD “Fusion” APU – x86 + OpenCL ● Intel “Sandy Bridge” and successors – x86 + OpenCL ● Nvidia “Tegra” – ARM + CUDA 23 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Energy saving with HCS 24 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Energy saving with HCS ● Trade-of – Performance vs. energy consumption ● Modern PUs are delivered with extensive power-saving features – e.g.: Power Regions, Clock Gating ● Less aggressive energy saving in HPC – Reason: get rid of state transition penalties ● Aggressive ES in mobile/embedded – Battery-life is everything in this domain 25 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Energy saving with HCS (hardware: DVFS) ● Dynamic Voltage/Frequency Scaling (DVFS) – P = C*V²*f ● with f~V; C=const. – Reduce f by 20% ● → P: -50%! – How far can we lower f/V to meet our timing constraints? 26 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Energy saving with HCS (software: intelligent work distribution) ● Intelligent Workload Partitioning – Power model of tasks and PU – Assign tasks with respect to power model – T ake communication overhead into account 27 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

Conclusion 28 December 2, 2015 Adv. Seminar CE // Stefgen Lammel

CPU-GPU Heterogeneous Computing Advanced Seminar "Computer - PowerPoint PPT Presentation

CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-T erm 2015/16 Stefgen Lammel December 2, 2015 1 December 2, 2015 Adv. Seminar CE // Stefgen Lammel Content Energy Saving Introduction with HCT

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing www.multi2sim.org

HMM: GUP NO MORE ! XDC 2018 Jrme Glisse HETEROGENEOUS COMPUTING CPU is dead, long live the

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

www.redrokk.com 1 Current Affjliation New Summary Report - 23 March 2016 1. Which of the

The Use of Topical Olive Oil as an Effective Preventative Measure for Pressure Ulcers: A

BYBLOS CAMPUS Libraries must do what they always have done: FACILITIES MANAGEMENT Be the

An intelligent product-information presentation in E-commerce S.S. Manvi, P. Venkataram *

Business Environment in the Southern Mediterranean A project funded by the European Union

MVC Guided Pathways Brief review of Guided Pathways at MVC Plan for Today Spring

INFORMATION TECHNOLOGY DIRECTIONS Strategy Presentation to University Budget Committee Chris

care products Stockholm, February 2020 Disclaimer Impo Important ant infor informat mation