AMDs Unified CPU & GPU Processor Concept Advanced Seminar - - PowerPoint PPT Presentation

amd s unified cpu amp gpu processor concept
SMART_READER_LITE
LIVE PREVIEW

AMDs Unified CPU & GPU Processor Concept Advanced Seminar - - PowerPoint PPT Presentation

AMDs Unified CPU & GPU Processor Concept Advanced Seminar Computer Engineering Sven Nobis Institute of Computer Engineering (ZITI) University of Heidelberg February 5, 2014 Overview AMDs 1 Introduction Unified CPU & GPU


slide-1
SLIDE 1

AMD’s Unified CPU & GPU Processor Concept

Advanced Seminar Computer Engineering Sven Nobis

Institute of Computer Engineering (ZITI) University of Heidelberg

February 5, 2014

slide-2
SLIDE 2

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Overview

1 Introduction 2 Background

CPU vs. GPU Current Platforms: OpenCL & CUDA

3 Related Work 4 The way to HSA

Heterogeneous Unified Memory Access

5 Heterogeneous System Architecture

Concepts System Components Development Tools

6 Conclusion / Outlook

2/37

slide-3
SLIDE 3

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Previous: Single-Core Era

? Single-thread Performance

Time we are here

Enabled by:  Moore’s Law  Voltage

Scaling

Constrained by: Power Complexity

Single-Core Era

   Moore’s Law 

Assembly  C/C++  Java …  …   

[8, P. 5] 3/37

slide-4
SLIDE 4

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Today: Multi-Core Era

? Single-thread Performance

Time we are here

Enabled by:  Moore’s Law  Voltage

Scaling

Constrained by: Power Complexity

Single-Core Era

  Throughput Performance Time (# of processors) we are here

Enabled by:

 Moore’s Law  SMP architecture

Constrained by:

Power Parallel SW Scalability

Multi-Core Era

Assembly  C/C++  Java … pthreads  OpenMP / TBB …   

[8, P. 5] 4/37

slide-5
SLIDE 5

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Today till future: Heterogeneous System Era

? Single-thread Performance

Time we are here

Enabled by:  Moore’s Law  Voltage

Scaling

Constrained by: Power Complexity

Single-Core Era

Modern Application Performance

Time (Data-parallel exploitation) we are here

Heterogeneous Systems Era

Enabled by:

 Abundant data parallelism  Power efficient GPUs

Temporarily Constrained by:

Programming models Comm.overhead Throughput Performance Time (# of processors) we are here

Enabled by:

 Moore’s Law  SMP architecture

Constrained by:

Power Parallel SW Scalability

Multi-Core Era

Assembly  C/C++  Java … pthreads  OpenMP / TBB … Shader  CUDA OpenCL  C++ and Java

[8, P. 5] 5/37

slide-6
SLIDE 6

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Introduction

Today’s problems on CPU / GPU programming

programmability barrier communication costs

Solution

AMD’s Unified CPU & GPU Processor Concept?

→ Heterogeneous System Architecture (HSA)

[3, P. 4] 6/37

slide-7
SLIDE 7

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Introduction

Today’s problems on CPU / GPU programming

programmability barrier communication costs

Solution

AMD’s Unified CPU & GPU Processor Concept?

→ Heterogeneous System Architecture (HSA)

[3, P. 4] 6/37

slide-8
SLIDE 8

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Overview

1 Introduction 2 Background

CPU vs. GPU Current Platforms: OpenCL & CUDA

3 Related Work 4 The way to HSA

Heterogeneous Unified Memory Access

5 Heterogeneous System Architecture

Concepts System Components Development Tools

6 Conclusion / Outlook

7/37

slide-9
SLIDE 9

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

CPU vs. GPU

CPU: LCU

Latency Compute Unit

GPU: TCU

Throughput Compute Unit

8/37

slide-10
SLIDE 10

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

OpenCL & CUDA

Both well-established platforms for GPU programming Compute Unified Device Architecture (CUDA)

Proprietary Only for NVIDIA GPUs

Open Computing Language (OpenCL)

Open standard ATI, NVIDIA, Intel, ... Not only GPUs

9/37

slide-11
SLIDE 11

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

OpenCL

Platform Model

[10] 10/37

slide-12
SLIDE 12

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

OpenCL

Execution Model

[5, P. 11] 11/37

slide-13
SLIDE 13

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Overview

1 Introduction 2 Background

CPU vs. GPU Current Platforms: OpenCL & CUDA

3 Related Work 4 The way to HSA

Heterogeneous Unified Memory Access

5 Heterogeneous System Architecture

Concepts System Components Development Tools

6 Conclusion / Outlook

12/37

slide-14
SLIDE 14

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Related Work

In CUDA [4]

Unified Virtual Addressing (UVA) in CUDA 4 Unified Memory in CUDA 6 → Developer view to the memory

Implicit copy & pinning

In OpenCL

Shared Virtual Memory

Copy is still necessary (for fast access)

13/37

slide-15
SLIDE 15

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Overview

1 Introduction 2 Background

CPU vs. GPU Current Platforms: OpenCL & CUDA

3 Related Work 4 The way to HSA

Heterogeneous Unified Memory Access

5 Heterogeneous System Architecture

Concepts System Components Development Tools

6 Conclusion / Outlook

14/37

slide-16
SLIDE 16

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

CPU and GPU cores in a single die

APU GPU CPU

Llano

[3, P. 2] [7, P. 7] 15/37

slide-17
SLIDE 17

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

hUMA: Heterogeneous Unified Memory Access

Today: Non-Uniform Memory Access

Different/partitioned physical memory per compute unit Multiple virtual memory address spaces

hUMA: Heterogeneous Unified Memory Access

Same physical memory Same virtual memory for all compute units

PHYSICAL MEMORY

Multiple Virtual memory address spaces CPU0 GPU

VIRTUAL MEMORY1 PHYSICAL MEMORY

VA1->PA1 VA2->PA1

VIRTUAL MEMORY2

16/37

slide-18
SLIDE 18

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

hUMA: Heterogeneous Unified Memory Access

Today: Non-Uniform Memory Access

Different/partitioned physical memory per compute unit Multiple virtual memory address spaces

hUMA: Heterogeneous Unified Memory Access

Same physical memory Same virtual memory for all compute units

PHYSICAL MEMORY

Multiple Virtual memory address spaces CPU0 GPU

VIRTUAL MEMORY1 PHYSICAL MEMORY

VA1->PA1 VA2->PA1

VIRTUAL MEMORY2 PHYSICAL MEMORY

Common Virtual Memory for all HSA agents CPU0 GPU

VIRTUAL MEMORY PHYSICAL MEMORY

VA->PA VA->PA

[2, P. 7], [2, P. 8] 16/37

slide-19
SLIDE 19

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

hUMA: Heterogeneous Unified Memory Access (2)

Required: hUMA Memory Controller Features

Shared page table support

Same large address space as the CPU Page faulting

Coherent memory regions

Fully coherent shared memory model Like on today’s SMP CPU systems

17/37

slide-20
SLIDE 20

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Overview

1 Introduction 2 Background

CPU vs. GPU Current Platforms: OpenCL & CUDA

3 Related Work 4 The way to HSA

Heterogeneous Unified Memory Access

5 Heterogeneous System Architecture

Concepts System Components Development Tools

6 Conclusion / Outlook

18/37

slide-21
SLIDE 21

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Concepts

Unified Address Space

Already mentioned with hUMA

Unified Programming Model Queuing HSA Intermediate Language

19/37

slide-22
SLIDE 22

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Concepts

Unified Programming Model

Current programming models

→ Treating the GPU as a remote processor

Extending existing concepts to use HSA

Programming languages like C++ Task parallel and data parallel APIs like C++ AMP

Stay in developers environment

#include <iostream> #include <amp.h> using namespace concurrency; int main() // "Hello World" in C++ AMP { int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; array_view<int> av(11, v); parallel_for_each(av.extent, [=](index<1> idx) restrict(amp) { av[idx] += 1; }); for(unsigned int i = 0; i < av.extent.size(); i++) std::cout << static_cast<char>(av(i)); } [6] 20/37

slide-23
SLIDE 23

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Concepts

Queuing - Current

[5, P.9] 21/37

slide-24
SLIDE 24

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Concepts

Queuing - New!

[5, P.9] 22/37

slide-25
SLIDE 25

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Concepts

HSA Intermediate Language

HSAIL: HSA Intermediate Language

Bytecode Designed for data parallel programming GPU independent

Generated by compilation stack (later) Bytecode is compiled at runtime

to the Hardware Instruction Set of the current device

Execution Model is similar to OpenCL

23/37

slide-26
SLIDE 26

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

System Components

APU Software stack

Compilation Stack Runtime Stack System (Kernel) Software

24/37

slide-27
SLIDE 27

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

System Components

Compilation Stack

[5, P. 15] 25/37

slide-28
SLIDE 28

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

System Components

Runtime-Stack

[5, P. 16] 26/37

slide-29
SLIDE 29

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Development Tools

OpenCL C++ AMP: C++ Accelerated Massive Parallelism BOLT Library Aparapi

27/37

slide-30
SLIDE 30

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Development Tools

OpenCL

”HSA is an optimized platform architecture for OpenCL

  • Not an alternative to OpenCL” [8, P. 13]

OpenCL on HSA will benefit from its features

28/37

slide-31
SLIDE 31

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Development Tools

BOLT Library

Simple Example:

SIMPLE BOLT EXAMPLE

#include <bolt/sort.h> #include <vector> #include <algorithm> void main() { // generate random data (on host) std::vector<int> a(1000000); std::generate(a.begin(), a.end(), rand); // sort, run on best device bolt::sort(a.begin(), a.end()); }

[9, P.5] 29/37

slide-32
SLIDE 32

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Development Tools

BOLT and C++ AMP

Simple Example:

BOLT FOR C++ AMP : USER-SPECIFIED FUNCTOR

#include <bolt/transform.h> #include <vector> struct SaxpyFunctor { float _a; SaxpyFunctor(float a) : _a(a) {}; float operator() (const float &xx, const float &yy) restrict(cpu,amp) { return _a * xx + yy; }; }; void main() { SaxpyFunctor s(100); std::vector<float> x(1000000); // initialization not shown std::vector<float> y(1000000); // initialization not shown std::vector<float> z(1000000); bolt::transform(x.begin(), x.end(), y.begin(), z.begin(), s); };

[9, P.6] 30/37

slide-33
SLIDE 33

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Overview

1 Introduction 2 Background

CPU vs. GPU Current Platforms: OpenCL & CUDA

3 Related Work 4 The way to HSA

Heterogeneous Unified Memory Access

5 Heterogeneous System Architecture

Concepts System Components Development Tools

6 Conclusion / Outlook

31/37

slide-34
SLIDE 34

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Conclusion

Interesting concept

Simplifies development Open up new possibilities

Open platform In heavy development

Missing hardware with hUMA

→ Outlook

Software components not ready

→ A lot of potential

32/37

slide-35
SLIDE 35

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

Outlook

Middle of January 2014:

Kaveri APU is available [1] Desktop APU Support for

hUMA Queuing

Can connect both DDR3 and GDDR5 [11]

Server APU follows:

Berlin ARM-Based: Seattle

[11] 33/37

slide-36
SLIDE 36

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

References I

[1] Benz, Benjamin: AMD fordert mit Kaveri Intels Core i5

  • heraus. Heise Online. http://heise.de/-2085447.

Version: Januar 2014 [2] Bratt, Ian: HSA Queueing. HOT CHIPS 2013. http://www.slideshare.net/hsafoundation/ hsa-queuing-hot-chips-2013. Version: August 2013 [3] Fr¨

  • ning, Holger: Lecture 02 – CUDA Programming.

Lecture: GPU Computing, 2013 [4] Harris, Mark: Unified Memory in CUDA 6. http://devblogs.nvidia.com/parallelforall/ unified-memory-in-cuda-6/. Version: November 2013

34/37

slide-37
SLIDE 37

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

References II

[5] Kyriazis, George: A Heterogeneous System Architecture: Technical Review / HSA Foundation. AMD, August 2012. – Forschungsbericht. – Rev. 1.0 S. [6] Moth, Daniel: ”Hello world” in C++ AMP. http://blogs.msdn.com/b/nativeconcurrency/ archive/2012/03/04/ quot-hello-world-quot-in-c-amp.aspx. Version: M¨ arz 2012 [7] Rogers, Phil: THE PROGRAMMER’S GUIDE TO THE APU GALAXY. AMD Fusion Developer Summit. http://www.slideshare.net/hsafoundation/ afds-keynote-the-programmers-guide-to-the-apu-galaxy Version: Juni 2011

35/37

slide-38
SLIDE 38

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

References III

[8] Rogers, Phil: Heterogeneous System Architecture

  • Overview. HOT CHIPS 2013.

http://de.slideshare.net/hsafoundation/ hsa-intro-hot-chips2013-final. Version: August 2013 [9] Sander, Ben: BOLT: A C++ Template Library for HSA. AMD Fusion Developer Summit. http://www.slideshare.net/hsafoundation/ bolt-for-hsa-by-ben-sanders. Version: Juni 2012 [10] Staff, AMD: OpenCL™ and the AMD APP SDK v2.4. http://developer.amd.com/resources/ documentation-articles/articles-whitepapers/

  • pencl-and-the-amd-app-sdk-v2-4/. Version: April

2011

36/37

slide-39
SLIDE 39

AMD’s Unified CPU & GPU Processor Concept Sven Nobis Introduction Background

CPU vs. GPU OpenCL & CUDA

Related Work The way to HSA

Heterogeneous Unified Memory Access

HSA

Concepts System Components Development Tools

Conclusion / Outlook References

References IV

[11] Windeck, Christof: AMD Kaveri: Feinheiten aus den Datenbl¨

  • attern. Heise Online.

http://heise.de/-2088349. Version: Januar 2014

37/37