December 2, 2015
- Adv. Seminar CE // Stefgen Lammel
1
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer - - PowerPoint PPT Presentation
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-T erm 2015/16 Stefgen Lammel December 2, 2015 1 December 2, 2015 Adv. Seminar CE // Stefgen Lammel Content Energy Saving Introduction with HCT
December 2, 2015
1
December 2, 2015
2
– Motivation – Characteristics of CPUs
– Workload division – Frameworks and tools – Fused HCS
– Intelligent workload
– Dynamic
– Programming aspects – Energy aspects
December 2, 2015
3
December 2, 2015
4
Source: [1] Source: [2]
Compare: #1 TOP500: ~33PF @ 1,9GF/W
December 2, 2015
5
– Few cores (<= 20) – High frequency (~3GHz) – Large caches, plenty of
– Latency oriented
– Many cores (> 1000) – Slow frequency (<=1GHz) – Fast memory, limited in size
– Throughput oriented
December 2, 2015
6
Terminology:
–
HCS: Heterogeneous Computing System (hardware)
–
HCT: Heterogeneous Computing T echnique (software)
–
PU: Processing Unit (can be both, CPU and GPU)
–
FLOPs: Floating Point Operations per second
–
BLAS: Basic Linear Algebra Subprograms
–
SIMD: Single Instruction Multiple Data
December 2, 2015
7
December 2, 2015
8
CPU GPU CPU GPU
December 2, 2015
9
December 2, 2015
10
– Divide the whole problem
– Assign each sub-task to a
– compare: “PCAM”, [5]
Sub-Task
... ... ... ... ...
Sub-Task n Sub-Task 1
December 2, 2015
11
(naive)
– CPU + GPU – Naive data distribution
– Master/Arbiter
– Worker
– Worker
core0 GPU core1
December 2, 2015
12
(relative PU performance)
– A microbenchmark or
– Partition the work in a 3:1
– T
core0 GPU core1
December 2, 2015
13
(characteristics of sub-tasks)
Sub-Task
...
Sub-Task 1
... ... ... ...
Sub-Task n
December 2, 2015
14
(nature of sub-tasks)
– Latency: CPU – Throughput: GPU
– Capability (of the PU) – Locality (of the data) – Criticality (of the task) – Availability (of the PU)
CPU1 CPU0 GPU
December 2, 2015
15
(pipeline)
PU0 PU1 PUn Task A.1 Task A.2 Task A.3 Task B.1 Task B.2 Task B.3 Task C.1 Task C.2 Task C.3
December 2, 2015
16
(relative PU performance)
– Order
communication)
communication)
– Memory Footprint
– BLAS-Level
Matrix operations)
– How well did each PU perform in the previous
step?
– Is there a function/kernel for the desired PU? – Is the PU able to take a task (scheduling-
wise)?
December 2, 2015
17
December 2, 2015
18
– Better: Let a framework do this job!
– Compile-Time Level (static scheduling) – Runtime Level (dynamic scheduling)
– Write the algorithm as a sequential program and let the tools fjgure out how to utilize
the PUs optimally
– Sourcecode annotations to give the run-time/compiler hints what approach is the best
(comp.: OpenMP #pragma_omp_xxx)
– Scheduling: dynamic or static
well!
December 2, 2015
19
– CUDA+Libraries (Nvidia GPU) – OpenMP, Pthreads (CPU)
– OpenCL, OpenACC – OpenMP (“offmoading”, since v4.0) – CUDA (CPU-callback)
– Compile-Time Level (static scheduling approach)
– Run-Time Level (dynamic scheduling approach)
December 2, 2015
20
(Example: SnuCL)
Source: [4]
December 2, 2015
21
(Example: PLASMA)
– Independent of PU
– Utilizes the PU's specifjc SIMD
capabilities
– Speed-Up depends on workload
Original Code IR Code
CPU Code GPU Code GPU
Runtime
CPU
December 2, 2015
22
December 2, 2015
23
– … and the same address space!
– x86 + OpenCL
– x86 + OpenCL
– ARM + CUDA
December 2, 2015
24
December 2, 2015
25
December 2, 2015
26
(hardware: DVFS)
December 2, 2015
27
(software: intelligent work distribution)
December 2, 2015
28
December 2, 2015
29
(programming aspects)
– Usability vs. performance – Portability vs. performance
programming efort
– “Raw” CUDA/OpenMP
support developers to a certain degree
– OpenCL, OpenACC, custom solutions
accelerate an application
– Complicated cases, many PUs
P e r f
m a n c e Ease of programming Portability
December 2, 2015
30
(energy aspects)
December 2, 2015
31
December 2, 2015
32
Papers: A Survey of CPU-GPU Heterogeneous Computing T echniques (Link) SnuCL: an OpenCL Framework for Heterogeneous CPU/GPU Clusters (Link) PLASMA: Portable Programming for SIMD Heterogeneous Accelerators (Link) Images: [1] http://www.hpcwire.com/2015/08/04/japan-takes-top-three-spots-on-green500-list/ [2] http://www.top500.org/lists/2015/11/ [3] https://en.wikipedia.org/wiki/Performance_per_watt#FLOPS_per_watt [4] http://snucl.snu.ac.kr/features.html [5] http://www.mcs.anl.gov/~itf/dbpp/text/node15.html