HSA Foundation Politecnico di Milano Seminar Room (Bld 20) 15 - - PowerPoint PPT Presentation

hsa foundation
SMART_READER_LITE
LIVE PREVIEW

HSA Foundation Politecnico di Milano Seminar Room (Bld 20) 15 - - PowerPoint PPT Presentation

Advanced Topics on Heterogeneous System Architectures HSA Foundation Politecnico di Milano Seminar Room (Bld 20) 15 December, 2017 Antonio R. Antonio R. Miele Miele Marco D. Santambrogio Marco D. Santambrogio Politecnico


slide-1
SLIDE 1

Advanced Topics on Heterogeneous System Architectures

Politecnico di Milano Seminar Room (Bld 20) 15 December, 2017 Antonio R. Antonio R. Miele Miele Marco D. Santambrogio Marco D. Santambrogio Politecnico di Milano

HSA Foundation

slide-2
SLIDE 2

References

  • This presentation is based on the material and

slides published on the HSA foundation website:

– http://www.hsafoundation.com/

2

slide-3
SLIDE 3

Heterogeneous processors have proliferated – make them better

  • Heterogeneous SoCs have arrived and are

a tremendous advance over previous platforms

  • SoCs combine CPU cores, GPU cores and
  • ther accelerators, with high bandwidth

access to memory

  • How do we make them even better?

– Easier to program – Easier to optimize – Easier to load balance – Higher performance – Lower power

  • HSA unites accelerators architecturally
  • Early focus on the GPU compute accelerator, but HSA will go well

beyond the GPU

3

slide-4
SLIDE 4

HSA foundation

  • Founded in June 2012
  • Developing a new platform

for heterogeneous systems

  • www.hsafoundation.com
  • Specifications under

development in working groups to define the platform

  • Membership consists of 43 companies and 16

universities

  • Adding 1-2 new members each month

4

slide-5
SLIDE 5

HSA consortium

5

slide-6
SLIDE 6

HSA goals

  • To enable power-efficient performance
  • To improve programmability of heterogeneous

processors

  • To increase the portability of code across

processors and platforms

  • To increase the pervasiveness of

heterogeneous solutions throughout the industry

6

slide-7
SLIDE 7

Paradigm shift

7

  • Inflection in processor design and programming
slide-8
SLIDE 8

Key features of HSA

  • hUMA

hUMA – – Heterogeneous Unified Memory Architecture

  • hQ

hQ – – Heterogeneous Queuing

  • HSAIL –

HSAIL – HSA Intermediate Language

8

slide-9
SLIDE 9

Key features of HSA

  • hUMA

hUMA – – Heterogeneous Unified Memory Architecture

  • hQ

hQ – – Heterogeneous Queuing

  • HSAIL –

HSAIL – HSA Intermediate Language

9

slide-10
SLIDE 10

Legacy GPU compute

  • Multiple memory pools
  • Multiple address spaces

– No pointer-based data structures

  • Explicit data copying across PCIe

– High latency – Low bandwidth

  • High overhead dispatch
  • Need lots of compute on GPU to

amortize copy overhead

  • Very limited GPU memory capacity
  • Dual source development
  • Proprietary environments
  • Expert programmers only

10

slide-11
SLIDE 11

Existing APUs and SoCs

  • Physical integration of GPUs and CPUs
  • Data copies on an internal bus
  • Two memory pools remain
  • Still queue through the OS
  • Still requires expert programmers

11

APU = Accelerated Processing Unit (i.e. a SoC containing also a GPU)

  • FPGAs and DSPs

have the same issues

slide-12
SLIDE 12

Existing APUs and SoCs

  • CPU and GPU still have separate memories for the

programmer (different virtual memory spaces)

1. CPU explicitly copies data to GPU memory 2. GPU executes computation 3. CPU explicitly copies results back to its own memory

12

slide-13
SLIDE 13

An HSA enabled SoC

  • Unified Coherent Memory enables data sharing across all

processors

– Enabling the usage of pointers – Not explicit data transfer -> values move on demand – Pageable virtual addresses for GPUs -> no GPU capacity constraints

  • Processors architected to operate cooperatively
  • Designed to enable the application to run on different processors

at different times

13

slide-14
SLIDE 14

Unified coherent memory

14

slide-15
SLIDE 15

Unified coherent memory

15

  • CPU and GPU have a unified virtual memory spaces

1. CPU simply passes a pointer to GPU 2. GPU executes computation 3. CPU can read the results directly – no explicit copy need!

slide-16
SLIDE 16

Unified coherent memory

16

slide-17
SLIDE 17

Unified coherent memory

17

Transmission of input data

slide-18
SLIDE 18

Unified coherent memory

18

slide-19
SLIDE 19

Unified coherent memory

19

slide-20
SLIDE 20

Unified coherent memory

20

slide-21
SLIDE 21

Unified coherent memory

21

slide-22
SLIDE 22

Unified coherent memory

22

Transmission of results

slide-23
SLIDE 23

Unified coherent memory

23

slide-24
SLIDE 24

Unified coherent memory

24

slide-25
SLIDE 25

Unified coherent memory

25

slide-26
SLIDE 26

Unified coherent memory

26

slide-27
SLIDE 27

Unified coherent memory

27

slide-28
SLIDE 28

Unified coherent memory

28

slide-29
SLIDE 29

Unified coherent memory

29

  • OpenCL 2.0 leverages HSA

memory organizaGon to implement a virtual shared memory (VSM) model

  • VSM can be used to share

pointers in the same context among devices and the host

slide-30
SLIDE 30

Key features of HSA

  • hUMA

hUMA – – Heterogeneous Unified Memory Architecture

  • hQ

hQ – – Heterogeneous Queuing

  • HSAIL –

HSAIL – HSA Intermediate Language

30

slide-31
SLIDE 31

hQ: heterogeneous queuing

  • Task queuing runtimes

– Popular pattern for task and data parallel programming on Symmetric Multiprocessor (SMP) systems – Characterized by:

  • A work queue per core
  • Runtime library that divides large loops into tasks and

distributes to queues

  • A work stealing scheduler that keeps system balanced
  • HSA is designed to extend this pattern to run on

heterogeneous systems

31

slide-32
SLIDE 32

hQ: heterogeneous queuing

32

  • How compute dispatch operates today in the

driver model

slide-33
SLIDE 33

hQ: heterogeneous queuing

  • How compute dispatch

improves under HSA

– Application codes to the hardware – User mode queuing – Hardware scheduling – Low dispatch times

  • – No Soft Queues

– No User Mode Drivers – No Kernel Mode Transitions – No Overhead!

33

slide-34
SLIDE 34

hQ: heterogeneous queuing

34

  • AQL (Architected Queueing Layer) enables any agent to

enqueue tasks

slide-35
SLIDE 35

hQ: heterogeneous queuing

35

  • AQL (Architected Queueing

Layer) enables any agent to enqueue tasks – Single compute dispatch path for all hardware – No driver translation, direct access to hardware – Standard across vendors

  • All agents can enqueue

– Allowed also self-enqueuing

  • Requires coherency and shared virtual memory
slide-36
SLIDE 36

hQ: heterogeneous queuing

36

  • A work stealing scheduler that keeps system

balanced

slide-37
SLIDE 37

Advantages of the queuing model

37

  • Today’s picture:
slide-38
SLIDE 38

Advantages of the queuing model

38

  • The unified shared

memory allows to share pointers among different processing elements thus avoiding explicit memory transfer requests

  • The unified shared

memory allows to share pointers among different processing elements thus avoiding explicit memory transfer requests

slide-39
SLIDE 39

Advantages of the queuing model

39

  • Coherent caches

remove the necessity to perform explicit synchronizaGon

  • peraGon
slide-40
SLIDE 40

Advantages of the queuing model

40

  • The supported

signaling mechanism enables asynchronous events between agents without involving the OS kernel

slide-41
SLIDE 41

Advantages of the queuing model

41

  • Tasks are directly

enqueued by the applicaGons without using OS mechanisms

slide-42
SLIDE 42

Advantages of the queuing model

42

  • HSA picture:
slide-43
SLIDE 43

Device side queuing

  • Let’s consider a tree traversal problem:

– Every node in the tree is a job to be executed – We may not know at priory the size of the tree – Input parameters of a job may depend on parent execution

43

  • Each node is a job
  • Each job may

generate some child jobs

slide-44
SLIDE 44

Device side queuing

  • State-of-the-art solution:

– The job has to communicate to the host the new jobs (possibly transmitting input data) – The host queues the child jobs on the device

44

Considerable memory traffic!

  • Each node is a job
  • Each job may

generate some child jobs

slide-45
SLIDE 45

Device side queuing

  • Device side queuing:

– The job running on the device directly queues new jobs in the device/host queues

45

  • Each node is a job
  • Each job may

generate some child jobs

slide-46
SLIDE 46

Device side queuing

  • Benefits of device side queuing:

– Enable more natural expression of nested parallelism necessary for applications with irregular

  • r data-driven loop structures(i.e. breadth first

search) – Remove of synchronization and communication with the host to launch new threads (remove expensive data transfer) – The finer granularities of parallelism is exposed to scheduler and load balancer

46

slide-47
SLIDE 47

Device side queuing

  • OpenCL 2.0 supports device side queuing

– Device-side command queues are out-of-order – Parent and child kernels execute asynchronously – Synchronization has to be explicitly managed by the programmer

47

slide-48
SLIDE 48

Summary on the queuing model

  • User mode queuing for low latency dispatch

– Application dispatches directly – No OS or driver required in the dispatch path

  • Architected Queuing Layer

– Single compute dispatch path for all hardware – No driver translation, direct to hardware

  • Allows for dispatch to queue from any agent

– CPU or GPU

  • GPU self-enqueue enables lots of solutions

– Recursion – Tree traversal – Wavefront reforming

48

slide-49
SLIDE 49

Other necessary HW mechanisms

  • Task preemption and context switching have to

be supported by all computing resources (also GPUs)

49

slide-50
SLIDE 50

Key features of HSA

  • hUMA

hUMA – – Heterogeneous Unified Memory Architecture

  • hQ

hQ – – Heterogeneous Queuing

  • HSAIL –

HSAIL – HSA Intermediate Language

50

slide-51
SLIDE 51

HSA intermediate layer (HSAIL)

  • A portable “virtual ISA” for vendor-independent

compilation and distribution

– Like Java bytecodes for GPUs

  • Low-level IR, close to machine ISA level

– Most optimizations (including register allocation) performed before HSAIL

  • Generated by a high-level compiler (LLVM, gcc, Java

VM, etc.)

– Application binaries may ship with embedded HSAIL

  • Compiled down to target ISA by a vendor-specific

“finalizer”

– Finalizer may execute at run time, install time, or build time

51

slide-52
SLIDE 52

HSA intermediate layer (HSAIL)

  • HSA compilation stack
  • HSA runtime stack

52

slide-53
SLIDE 53

HSA intermediate layer (HSAIL)

  • Explicitly parallel

– Designed for data parallel programming

  • Support for

exceptions, virtual functions, and other high level language features

  • Syscall methods

– GPU code can call directly system services, IO, printf, etc

53

slide-54
SLIDE 54

HSA intermediate layer (HSAIL)

  • Lower level than

OpenCL SPIR

– Fits naturally in the OpenCL compilation stack

  • Suitable to support

additional high level languages and programming models:

– Java, C++, OpenMP, C++, Python, etc…

54

slide-55
SLIDE 55

HSA software stack

55

HSAIL

  • HSA supports many languages
slide-56
SLIDE 56

HSA and OpenCL

  • HSA is an optimized platform architecture for

OpenCL

– Not an alternative to OpenCL

  • OpenCL on HSA will benefit from

– Avoidance of wasteful copies – Low latency dispatch – Improved memory model – Pointers shared between CPU and GPU – Device side queuing

  • OpenCL 2.0 leverages HSA Features

– Shared Virtual Memory – Platform Atomics

56

slide-57
SLIDE 57

HSA and Java

  • Targeted at Java 9 (2015 release)
  • Allows developers to efficiently

represent data parallel algorithms in Java

  • Sumatra “repurposes” Java 8’s multi-

core Stream/Lambda API’s to enable both CPU or GPU computing

  • At runtime, Sumatra enabled Java

Virtual Machine (JVM) will dispatch selected constructs to available HSA enabled devices

57

slide-58
SLIDE 58

HSA and Java

58

  • Evolution of the Java acceleration before the Sumatra

project

slide-59
SLIDE 59

HSA software stack

59

slide-60
SLIDE 60

HSA runtime

  • A thin, user-mode API that provides the interface

necessary for the host to launch compute kernels to the available HSA components

  • The overall goal is to provide a high-performance

dispatch mechanism that is portable across multiple HSA vendor architectures

  • The dispatch mechanism differentiates the HSA runtime

from other language runtimes by architected argument setting and kernel launching at the hardware and specification level

60

slide-61
SLIDE 61

HSA runtime

  • The HSA core runtime API is standard across all HSA

vendors, such that languages which use the HSA runtime can run on different vendor’s platforms that support the API

  • The implementation of the HSA runtime may include

kernel-level components (required for some hardware components, ex: AMD Kaveri) or may be entirely user- space (for example, simulators or CPU implementations)

61

slide-62
SLIDE 62

HSA runtime

62

slide-63
SLIDE 63

HSA taking platform to programmers

  • Balance between CPU and GPU for performance and

power efficiency

  • Make GPUs accessible to wider audience of

programmers

– Programming models close to today’s CPU programming models – Enabling more advanced language features on GPU – Shared virtual memory enables complex pointer-containing data structures (lists, trees, etc) and hence more applications on GPU – Kernel can enqueue work to any other device in the system (e.g. GPU->GPU, GPU->CPU)

  • Enabling task-graph style algorithms, Ray-Tracing, etc.

63

slide-64
SLIDE 64

HSA taking platform to programmers

  • Complete tool-chain for programming, debugging and

profiling

  • HSA provides a compatible architecture across a wide

range of programming models and HW implementations

64

slide-65
SLIDE 65

HSA programming model

  • Single source

– Host and device code side-by-side in same source file – Written in same programming language

  • Single unified coherent address space

– Freely share pointers between host and device – Similar memory model as multi-core CPU

  • Parallel regions identified with existing language syntax

– Typically same syntax used for multi-core CPU

  • HSAIL is the compiler IR that supports these

programming models

65

slide-66
SLIDE 66

Specifications and software

66

slide-67
SLIDE 67

HSA architecture V1

  • GPU compute C++ support
  • User Mode Scheduling
  • Fully coherent memory between CPU & GPU
  • GPU uses pageable system memory via CPU pointers
  • GPU graphics pre-emption
  • GPU compute context switch

67

slide-68
SLIDE 68

Partners roadmaps

68

slide-69
SLIDE 69

Partners roadmaps

69

2015

slide-70
SLIDE 70

Partners roadmaps

70

slide-71
SLIDE 71

Partners roadmaps

71

slide-72
SLIDE 72

Partners roadmaps

72

slide-73
SLIDE 73

Partners roadmaps

73