Communication Analysis of the Communication Analysis of the - - PowerPoint PPT Presentation

communication analysis of the communication analysis of
SMART_READER_LITE
LIVE PREVIEW

Communication Analysis of the Communication Analysis of the - - PowerPoint PPT Presentation

Communication Analysis of the Communication Analysis of the Communication Analysis of the Cell Broadband Engine Processor Cell Broadband Engine Processor Cell Broadband Engine Processor Fabrizio Petrini Pacific Northwest National Laboratory


slide-1
SLIDE 1

EDGE Workshop, UNC, May 2006

Communication Analysis of the Cell Broadband Engine Processor Communication Analysis of the Communication Analysis of the Cell Broadband Engine Processor Cell Broadband Engine Processor

Fabrizio Petrini Pacific Northwest National Laboratory fabrizio.petrini@pnl.gov Michael Perrone IBM TJ Watson mpp@us.ibm.com Michael Kistler and Gordon Fossum IBM Austin Research Laboratory mkistler@us.ibm.com, fossum@us.ibm.com

slide-2
SLIDE 2

2

The Charm of the IBM Cell Broadband Engine The Charm of the IBM Cell Broadband The Charm of the IBM Cell Broadband Engine Engine

Extraordinary processing power

8 independent processing units (SPEs) One control processor

A traditional 64-bit PowerPC

At 3.2 Ghz the Cell

a peak performance of 204.8 Gflops/second (single

precision)

14.64 Gflops/second (double precision)

slide-3
SLIDE 3

3

Communication Performance Communication Performance Communication Performance

Internal bus (Element Interconnect Bus EIB) with peak performance of 204.8 Gbytes/second Memory Bandwidth 25.6 Gbytes/second Impressive I/O bandwidth

25 Gbytes/second inbound 35 Gbytes/second outbound

Many outstanding memory requests

Up to 128, typical of multi-threaded processors

slide-4
SLIDE 4

4

Moving the Spotlight from Processor Performance to Communication Performance Moving the Spotlight from Processor Moving the Spotlight from Processor Performance to Communication Performance Performance to Communication Performance

Traditionally the focus is on (raw) processor performance Emphasis is now shifting towards communication performance Lots of (peak) processing power inside a chip (approaching Teraflops/sec)

Small fraction is delivered to applications

Lots of (peak) aggregate communication bandwidth inside the chip (approaching Terabytes/sec)

But processing units do not interact frequently

Small on chip local memories

Little data reuse

Main memory bandwidth is the primary bottleneck And then I/O and network bandwidth

slide-5
SLIDE 5

5

Dangerous Connection Between Memory and Network Performance and Programmability Dangerous Connection Between Memory and Dangerous Connection Between Memory and Network Performance and Programmability Network Performance and Programmability Programming model is already a critical issue

And it is going to get worse

Low data-reuse increases the algorithmic complexity Memory and Network bandwidth are key to achieve performance and simplify the programming model Multi-core Uni-bus ☺

slide-6
SLIDE 6

6

Internal Structure of the Cell BE Internal Structure of the Cell BE Internal Structure of the Cell BE

PPE SPE1 SPE3 SPE5 SPE7 SPE0 SPE2 SPE4 SPE6 BIF IOIF0 MIC IOIF1

DATA ARBITER

slide-7
SLIDE 7

7

Cell BE Communication Architecture Cell BE Communication Architecture Cell BE Communication Architecture

SPUs can only access programs and data in their local storage SPEs have a DMA controller that performs transfers between local stores, main memory and I/O SPUs can post list a list of DMAs SPUs can also use mailboxes and signals to perform basic synchronizations More complex synchronization mechanisms can support atomic operations All resources can be memory mapped

slide-8
SLIDE 8

8

SPE Internal Architecture SPE Internal Architecture SPE Internal Architecture

SPU 3.2 GHz 1.6 GHz MFC

DMAC

DMA Queues

LS

BIU

EIB

(2) (1)

MMU

(3) TLB (4) (5) (6)

MIC

(6)

Memory (off chip)

(7) (7) (7) (7)

1.6 GHz

Channel Interface

MMIO

16B/cyc in/out 16B/cyc in/out

slide-9
SLIDE 9

9

Basic Latencies (3.2 Ghz) Basic Latencies (3.2 Ghz) Basic Latencies (3.2 Ghz)

90.61 290 TOTAL 43.75 140 Data Transfer for inter-SPE put 31.25 100 Coherence Protocol 3.125 10 List Element Fetch 9.375 30 DMA to EIB 3.125 10 DMA issue NANOSECONDS CYCLES LATENCY COMPONENT

slide-10
SLIDE 10

10

Is this is a processor or a supercomputer

  • n a chip?

Is this is a processor or a supercomputer Is this is a processor or a supercomputer

  • n a chip?
  • n a chip?

Striking similarities with high-performance networks for supercomputers

E.g., Quadrics Elan4

DMAs overlap computation and communication Similar programming model Similar synchronization algorithms

Barriers, allreduces, scatter & gather

We can adopt the same techniques that we already use in high-performance clusters and supercomputers!

slide-11
SLIDE 11

11

DMA Latency DMA Latency DMA Latency

200 400 600 800 1000 1200 4 16 64 256 1024 4096 16384 Latency (nanoseconds) Msg Size (bytes) Blocking Get, Memory Blocking Get, SPE Blocking Put, Memory Blocking Put, SPE

slide-12
SLIDE 12

12

DMA Bandwidth DMA Bandwidth DMA Bandwidth

5 10 15 20 25 4 16 64 256 1024 4096 16384 Bandwidth (GB/second) Msg Size (bytes) Blocking Get, Memory Blocking Get, SPE Blocking Put, Memory Blocking Put, SPE

slide-13
SLIDE 13

13

DMA batches (put) DMA batches (put) DMA batches (put)

5 10 15 20 25 30 4 16 64 256 1024 4096 16384 Bandwidth (GB/second) Msg Size (bytes) Blocking Batch = 2 Batch = 4 Batch = 8 Batch = 16 Batch = 32 Non Blocking

slide-14
SLIDE 14

14

Hot Spot Hot Spot Hot Spot

17 18 19 20 21 22 23 24 25 26 1 2 3 4 5 6 7 8 Aggregate Bandwidth (GB/second) SPEs Put, Memory Get, Memory Put, SPE Get, SPE

slide-15
SLIDE 15

15

Latency Distribution under Hot-Spot Latency Distribution under Hot Latency Distribution under Hot-

  • Spot

Spot

10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 12 14 Items Latency µs Blocking Put Non Blocking Put

slide-16
SLIDE 16

16

Aggregate Behavior Aggregate Behavior Aggregate Behavior

20 40 60 80 100 120 140 160 180 200 2 3 4 5 6 7 8 Aggregate Bandwidth (GB/second) SPEs Uniform Traffic Complement Traffic Pairwise Traffic, Put Pairwise Traffic, Get

slide-17
SLIDE 17

17

Putting the Pieces Back Together Putting the Pieces Back Together Putting the Pieces Back Together

We have discussed the “raw” communication capability of the network We now try to see how we can parallelize scientific application on the Cell BE

A point in a large design space

Sweep3D: a well known scientific application A case study to provide insight on the various aspects of the Cell BE

Parallelization strategies, nature of parallelism, actual

computation and communication performance

slide-18
SLIDE 18

18

Challenges Challenges Challenges

Initial excitement in the scientific community, but concerns about the

Actual fraction of performance that can be achieved with

real applications

Complexity of developing new applications Complexity of developing new parallelizing compilers Whether there is clear migration path for existing legacy

software, written using MPI, Shared memory programming libraries (Global Arrays, UPC, Cray Shmem, etc. )

slide-19
SLIDE 19

19

Sweep3D Sweep3D Sweep3D

Application kernel representative of the ASC workload

Considerable number of cycles on ASC machines Relevant for a number of national security applications at

PNNL

It solves 1-group time-independent discrete ordinates

three-dimensional neutron transport problem

slide-20
SLIDE 20

20

Sweep3D: data mapping and communication pattern Sweep3D: data mapping and Sweep3D: data mapping and communication pattern communication pattern

slide-21
SLIDE 21

21

Parallelization Strategy Parallelization Strategy Parallelization Strategy

Process level parallelism

We keep the existing MPI parallelization, to guarantee

seamless migration path of existing software

Thread-level parallelism

Take advantage of loop independency

Data-streaming parallelism

Data orchestration algorithms

Vector parallelism

To exploit vector units

Pipeline parallelism

Even-odd pipe optimizations

slide-22
SLIDE 22

22

slide-23
SLIDE 23

23

An arsenal of tools/techniques and

  • ptimizations

An arsenal of tools/techniques and An arsenal of tools/techniques and

  • ptimizations
  • ptimizations
slide-24
SLIDE 24

24

Work in progress Work in progress Work in progress

slide-25
SLIDE 25

25

How does it compare with other processors? How does it compare with other How does it compare with other processors? processors?

slide-26
SLIDE 26

26

Multicore surprises Multicore Multicore surprises surprises

High sustained floating point performance

64% in double precision (9 Gflops), 25% in single (50

Gflops)

Typical values of actual performance for Sweep3D are

5-10%

Memory bound

The real problem, is data movement, not floating point

performance

Outstanding Power Efficiency

2-4 times faster than BlueGene/L, the most power

efficient computer at the moment (conservative estimate)

slide-27
SLIDE 27

27

Conclusions Conclusions Conclusions

Papers available at the following URLs

Cell Multiprocessor Interconnection Network: Built for

Speed, IEEE Micro, May/June 2006

http://hpc.pnl.gov/people/fabrizio/ieeemicro-cell.pdf

Multicore Surprises: Lesson Learned from Optimizing

Sweep3D on the Cell Broadband Engine, Submitted for publication

http://hpc.pnl.gov/people/fabrizio/sweep3d-cell.pdf