I. Introduction Highly parallel Graphics computing GPU (Parallel - - PowerPoint PPT Presentation

i introduction
SMART_READER_LITE
LIVE PREVIEW

I. Introduction Highly parallel Graphics computing GPU (Parallel - - PowerPoint PPT Presentation

Talk outline [25 slides ] Medical Image Analysis on GPUs: 1. Introduction [2 slides] Challenges and Future Trends 2. Programming GPUs: OpenACC/CUDA [8 slides] 3. Hardware designs: Kepler/Echelon [14 slides] 4. Conclusions [1 slide] Manuel


slide-1
SLIDE 1

Medical Image Analysis on GPUs: Challenges and Future Trends

Manuel Ujaldón

Nvidia CUDA Fellow

Associate Professor Computer Architecture Department University of Malaga (Spain)

Talk outline [25 slides]

  • 1. Introduction [2 slides]
  • 2. Programming GPUs: OpenACC/CUDA [8 slides]
  • 3. Hardware designs: Kepler/Echelon [14 slides]
  • 4. Conclusions [1 slide]

2

  • I. Introduction

3

The GPU and the CPU side by side

Oil & Gas Finance Medical Biophysics Numerics Audio Video Imaging

GPU

(Parallel computing) Graphics

CPU

(Sequential computing) Highly parallel computing Control and communication Productivity-based applications Data intensive applications

4

512 cores 4 cores

Use CPU and GPU: Every processor executes those parts where it gets more effective

slide-2
SLIDE 2

Millions of researchers Less than 5000 dollars Tesla graphics card Thousand of researchers Hundreds of researchers Between 50.000 and 1.000.000 dollars More than a million $

Cluster of Tesla servers

Large- scale clusters

End users for GPUs

5

  • II. Programming GPUs:

OpenACC / CUDA

6

The OpenACC initiative

7

OpenACC is an alternative to computer scientist’s CUDA for average programmers

The idea: Introduce a parallel programming standard for accelerators based on directives (like OpenMP), which:

Are inserted into C, C++ or Fortran programs to direct the compiler to parallelize certain code sections. Provide a common code base: Multi-platform and multi-vendor. Enhance portability across other accelerators and multicore CPUs. Bring an ideal way to preserve investment in legacy applications by enabling an easy migration path to accelerated computing. Relax programming effort (and expected performance).

First supercomputing customers:

United States: Oak Ridge National Lab. Europe: Swiss National Supercomputing Centre.

8

slide-3
SLIDE 3

OpenACC: The way it works

9

OpenACC: Use of directives

10

OpenACC: Results

11

CUDA: How we reached the current status

Before 2005 2005 - 2007 2008 - 2012

12

slide-4
SLIDE 4

CUDA: How programming elements are related to the underlying hardware

13

CUDA: Hardware targets from a source code

14

  • III. Hardware designs:

Kepler and Echelon

15

Mapping CUDA elements to the GPU

16

GPU Multiprocessor N Multiprocessor 2 Multiprocessor 1 Global memory

Shared memory

SIMD Control Unit Processor 1 Registers

Processor 2 Registers Processor M Registers

Constant cache Texture cache

!!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !!! !!! !!! !!! !!! !!! !!! !!! !!! Thread Thread block Grid 0 Grid 1 Memory integrated

  • n the GPU

External memory (but inside the graphics card)

slide-5
SLIDE 5

The evolution of the GPU hardware

17

Architecture Time frame CUDA Compute Capability (CCC) G80 GT200 Fermi GF100 Fermi GF104 Kepler GK104 Kepler GK110 2006-07 2008-09 2010 2011 2012 2013 1.0 1.2 2.0 2.1 3.0 3.5 N (multiprocs.) M (cores/multip.) Number of cores 16 30 16 7 8 15 8 8 32 48 192 192 128 240 512 336 1536 2880

Commercial models

18

Kepler GK110: Enhancements versus Fermi

19

GPU multiprocessors: From Fermi SMs to Kepler SMXs

20

slide-6
SLIDE 6

The memory hierarchy: Fermi vs. Kepler

21

The memory hierarchy

22

GPU generation Hardware model CUDA Compute Capability (CCC) Fer Fermi Kepl Kepler Limi- GF100 GF104 GK104 GK110 Limi- tation Impact 2.0 2.1 3.0 3.5 tation

  • Max. 32-bits registers / thread

32-bit registers / Multiprocessor Shared memory / Multiprocessor L1 cache / Multiprocessor L2 cache / GPU 63 63 63 255 SW. Working set 32 K 32 K 64 K 64 K HW. Working set

16-48KB 16-48KB 16-32-48KB 16-32-48 KB

HW. Tile size

48-16KB 48-16KB 48-32-16KB 48-32-16 KB

HW. Access speed

768 KB. 768 KB. 1536 KB. 1536 KB.

HW. Access speed

GPUDirect

Direct transfers between GPUs and network interfaces:

23

Kepler vs. Fermi: Large scale computations

24

GPU generation Hardware model Compute Capability (CCC) Fer Fermi Kepl Kepler GF100 GF104 GK104 GK110 Limitation Impact 2.0 2.1 3.0 3.5

  • Max. X dimension on CUDA grids

Dynamic parallelism Hyper-Q 2^16-1 2^16-1 2^32-1 2^32-1 Software Problem size No No No Yes Hardware Problem structure No No No Yes Hardware Thread scheduling

slide-7
SLIDE 7

25

The way we did biomedical image analysis in the past

`

High resolution image

Label 1 Label 2 background undetermined

Assign classification tasks Mapa de clasificación Image tiles (40x zoom)

CPU PS 3

Computational units

GPU

26

CPU GPU CPU GPU

The GPU as a co-processor The GPU as an autonomous processor

Dynamic parallelism

Parallelism can be deployed depending on data volume of each computational region

27

CUDA until 2012 CUDA on Kepler

Power can be assigned to regions proportionally to their requirements

Parallelism can be deployed depending on run-time evolution

It makes easier the GPU computation. It broadens the range of GPU applications.

28

slide-8
SLIDE 8

...mapped onto the GPU

29

E F D C B A

CPU processes...

Fermi without Hyper-Q: Temporal division on each SM.

A B C D E F

100 50 GPU utilization (%)

Time

Saved time

A A A B B B C C C D D D E E E F F F

Kepler with Hyper-Q: Simultaneous multip. SMX

100 50 GPU utilization (%)

Swift operations:

Thread array creation. Messages. Block transfers. Collective operations.

A look ahead: The Echelon execution model

30

A

B

Active message M e m

  • r

y h i e r a r c h y Global address space

Thread Object B

Load/Store

A B

B u l k X f e r

Conclusions

Kepler represents a new generation of GPU hardware, deploying thousands of cores to benefit from the CUDA model in large-scale applications. Major enhancements:

The GPU gets more autonomous, can create threads by itself. Threads scheduling gets more efficient, particularly on tiny threads. Larger caches and faster memory bandwidth, also between GPUs. The GPU can now execute much larger problem sizes and deploy more parallelism. Biomedicine and image processing are two fields which

can get more benefit from these enhancements.

31

Thanks for your attention!

You can always reach me at:

email: ujaldon@uma.es Web page at the University of Malaga: http://manuel.ujaldon.es Web site at Nvidia:

http://research.nvidia.com/users/manuel-ujaldon

For additional information, read the Kepler whitepaper:

http://www.nvidia.com/object/nvidia-kepler.html

To attend to the official talk given at GTC'12 (webinar):

http://www.gputechconf.com/gtcnew/on-demand-gtc.php#1481

To listen additional Nvidia material about GPU computing:

http://www.nvidia.com/object/webinar.html

32