Advanced Topics on Heterogeneous System Architectures
Politecnico di Milano Seminar Room (Bld 20) 15 December, 2017 Antonio R. Antonio R. Miele Miele Marco D. Santambrogio Marco D. Santambrogio Politecnico di Milano
HSA Foundation Politecnico di Milano Seminar Room (Bld 20) 15 - - PowerPoint PPT Presentation
Advanced Topics on Heterogeneous System Architectures HSA Foundation Politecnico di Milano Seminar Room (Bld 20) 15 December, 2017 Antonio R. Antonio R. Miele Miele Marco D. Santambrogio Marco D. Santambrogio Politecnico
Politecnico di Milano Seminar Room (Bld 20) 15 December, 2017 Antonio R. Antonio R. Miele Miele Marco D. Santambrogio Marco D. Santambrogio Politecnico di Milano
2
a tremendous advance over previous platforms
access to memory
– Easier to program – Easier to optimize – Easier to load balance – Higher performance – Lower power
beyond the GPU
3
4
5
6
7
8
9
– No pointer-based data structures
– High latency – Low bandwidth
amortize copy overhead
10
11
APU = Accelerated Processing Unit (i.e. a SoC containing also a GPU)
have the same issues
1. CPU explicitly copies data to GPU memory 2. GPU executes computation 3. CPU explicitly copies results back to its own memory
12
processors
– Enabling the usage of pointers – Not explicit data transfer -> values move on demand – Pageable virtual addresses for GPUs -> no GPU capacity constraints
at different times
13
14
15
1. CPU simply passes a pointer to GPU 2. GPU executes computation 3. CPU can read the results directly – no explicit copy need!
16
17
Transmission of input data
18
19
20
21
22
Transmission of results
23
24
25
26
27
28
29
30
distributes to queues
31
32
– Application codes to the hardware – User mode queuing – Hardware scheduling – Low dispatch times
– No User Mode Drivers – No Kernel Mode Transitions – No Overhead!
33
34
35
Layer) enables any agent to enqueue tasks – Single compute dispatch path for all hardware – No driver translation, direct access to hardware – Standard across vendors
– Allowed also self-enqueuing
36
37
38
39
40
41
42
43
44
45
46
47
– Single compute dispatch path for all hardware – No driver translation, direct to hardware
– CPU or GPU
– Recursion – Tree traversal – Wavefront reforming
48
49
50
– Like Java bytecodes for GPUs
– Most optimizations (including register allocation) performed before HSAIL
– Application binaries may ship with embedded HSAIL
– Finalizer may execute at run time, install time, or build time
51
52
– Designed for data parallel programming
– GPU code can call directly system services, IO, printf, etc
53
– Fits naturally in the OpenCL compilation stack
– Java, C++, OpenMP, C++, Python, etc…
54
55
HSAIL
56
57
58
59
60
61
62
– Programming models close to today’s CPU programming models – Enabling more advanced language features on GPU – Shared virtual memory enables complex pointer-containing data structures (lists, trees, etc) and hence more applications on GPU – Kernel can enqueue work to any other device in the system (e.g. GPU->GPU, GPU->CPU)
63
64
– Host and device code side-by-side in same source file – Written in same programming language
– Freely share pointers between host and device – Similar memory model as multi-core CPU
– Typically same syntax used for multi-core CPU
65
66
67
68
69
70
71
72
73