When Multicore Isnt Enough: Trends and the Future for - - PowerPoint PPT Presentation

when multicore isn t enough trends and the future for
SMART_READER_LITE
LIVE PREVIEW

When Multicore Isnt Enough: Trends and the Future for - - PowerPoint PPT Presentation

When Multicore Isnt Enough: Trends and the Future for Multi-Multicore Systems Matt Reilly Chief Engineer SiCortex, Inc 1 Monday, September 22, 2008 1 The Computational Model For a large set of interesting problems (N is number of


slide-1
SLIDE 1

When Multicore Isn’t Enough: Trends and the Future for Multi-Multicore Systems

Matt Reilly Chief Engineer SiCortex, Inc

1

1 Monday, September 22, 2008

slide-2
SLIDE 2

The Computational Model

For a large set of interesting problems (N is number of independent processes) Tsol = Tarith/N + Tmem/N + TIO + f(N)Tcomm

  • r

Tsol = MAX(Tarith/N, Tmem/N, TIO, f(N)Tcomm) For many interesting tasks, single CHIP performance is determined entirely by Tmem and memory bandwidth.

2

2 Monday, September 22, 2008

slide-3
SLIDE 3

Why Multicore?

We don’t get faster cores as often as we get more of them

Source: SPEC2000 FP Reports http://www.spec.org/cpu/results/cpu2000.html

3

3 Monday, September 22, 2008

slide-4
SLIDE 4

Compute Node Design: A Memory Game

Tarith is becoming irrelevant. (Because N is getting large.) The design of the compute node is all about maximizing usable bandwidth between the compute elements and a large block of memory Multicore GPGPU Hybrid/ScalarVector (e.g. Cell) The architecture choice drives the programming model, but all are otherwise interchangeable.

4

4 Monday, September 22, 2008

slide-5
SLIDE 5

FFT Kernel

If Arithmetic is free, but pins are limited...

5

5 Monday, September 22, 2008

slide-6
SLIDE 6

Stencil (Convolution) Kernel

0.4 Bytes/FLOP? Then the processor spends 1/2 time waiting.

6

6 Monday, September 22, 2008

slide-7
SLIDE 7

Alternatives

Many fast cores on one die Require commensurate memory ports: high pin count High pin count and high processor count: large die A few fast cores on one die Better balance Tarith : Tmem Smaller die A few moderate cores on one die Balance Tarith : Tmem : Tcomm Spend pins on other features

7

7 Monday, September 22, 2008

slide-8
SLIDE 8

Cubic Domain Decomposition

Simple 7x7x7 “Jax” stencil operator over a large volume (1K3 single precision): 19 flops per point.

8

8 Monday, September 22, 2008

slide-9
SLIDE 9

Cubic Domain Decomposition

Set a goal of completing a pass in 1mS. Faster processors complete larger chunks of the total volume.

9

9 Monday, September 22, 2008

slide-10
SLIDE 10

Cubic Domain Decomposition

Factor in Tcomm and we find that a 200MB/s per- node link forces a chunk size of 503.

10

10 Monday, September 22, 2008

slide-11
SLIDE 11

Cubic Domain Decomposition

If the goal is “time per step,” computation speed may not matter. GPUs, FPGAs, Magic Dust, don’t help.

11

11 Monday, September 22, 2008

slide-12
SLIDE 12

The Systems

A Family: From Production Systems to Personal Development Workstations

12

12 Monday, September 22, 2008

slide-13
SLIDE 13

The SC5832

5832 Processors 7.7TB Memory > 200 FC I/O Channels Single Linux System 16KW Cool and Reliable

13

13 Monday, September 22, 2008

slide-14
SLIDE 14

SiCortex in the Technical Computing Ecosystem

Affordable, easy-to-install, easy-to-maintain Development platforms for high processor count applications Rich cluster/MPI development environment Systems from 72 to 5832 processors Production platforms in target application areas: Multidimensional FFT Large Matrix Sorting/Searching

14

14 Monday, September 22, 2008

slide-15
SLIDE 15

DMA Engine Fabric Switch and SERDES Units DDR-2 Controller Processor/Memory Switch PCI Express Controller L1 I/D Cache L2 Cache CPU Six 64 bit MIPS CPUs 500 MHz, 1GF/s double-precision 32+32 KB L1 Cache 256 KB L2 Cache ECC 1152 pin BGA 170 sq mm 90nm DDR-2 Controller To/From other nodes 1.6GB/s per link 2 x 4 GB DDR-2 DIMMs External I/O 8-lane PCI Express L1 I/D Cache L2 Cache CPU

The SiCortex Node Chip

Six way Linux SMP with 2 DDR ports, PCI Express, Message controller, and fabric switch

15

15 Monday, September 22, 2008

slide-16
SLIDE 16

8

The SiCortex Module

PCI Express I/O Everything Else Memory Compute: 162 GF/sec Memory b/w: 345 GB/sec Fabric b/w: 78 GB/sec I/O b/w: 7.5 GB/sec Power: 500 Watts Fabric Interconnect

16

16 Monday, September 22, 2008

slide-17
SLIDE 17

9

The SiCortex System

36 Modules with Midplane Interconnect I/O Cable Management Fan Tray Power Distribution and System Service Processor

17

17 Monday, September 22, 2008

slide-18
SLIDE 18

The Kautz Graph

Logarithmic diameter Reconfigure around failures Low contention Very fast collectives

18

18 Monday, September 22, 2008

slide-19
SLIDE 19

Thirty-six node Kautz graph

A pattern is developing

35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

19

19 Monday, September 22, 2008

slide-20
SLIDE 20

5

Integrated HPC Linux Environment

Operating System

Linux kernel and utilities (2.6.18+) Cluster file system (Lustre)

Development Environment

GNU C, C++ Pathscale C, C++, Fortran Math libraries Performance tools Debugger (TotalView) MPI libraries (MPICH2)

System Management

Scheduler (SLURM) Partitioning Monitoring Console, boot, diagnostics

Maintenance and Support

Factory-installed software Regular updates Open source build environment

Linux GNU MPI

Libraries gentoo

20

20 Monday, September 22, 2008

slide-21
SLIDE 21

Tuning Tools

Serial code (hpcex) Comm (mpiex) IO (ioex) System (oprofile) Hardware (papiex) Visualization (tau, vampir)

21

21 Monday, September 22, 2008

slide-22
SLIDE 22

Parallel File System

Lustre Parallel File System Open Source Posix Compliant Native Implementation Uses DMA Engine Primitives Scalable Up to hundreds of I/O nodes

22

22 Monday, September 22, 2008

slide-23
SLIDE 23

FabriCache

RAM-backed file system Based on Lustre file system Store all data in Object Storage Server RAM Present data as a file system Scalable to 972 OSS nodes Similar to an SSD, but... Higher bandwidth / lower latency No external hardware required Creating/removing volumes is easier Useful for... Intermediate results Shared pools of data Staging data to/from Disk

23

23 Monday, September 22, 2008

slide-24
SLIDE 24

MicroBenchmarks and Kernels

  • MPI Latency - 1.4 µsec
  • MPI BW - 1.5 GB/s
  • HPC Challenge work underway
  • SC5832, on 5772 cpus:

–DGEMM 72% –HPL 3.6 TF (83% of DGEMM) –PTRANS 210 GB/s –STREAM 345 MB/s (1.9 TB/s aggregate) –FFT 174 GF –RandomRing 4 usec, 50 MB/s –RandomAccess 0.74 GUPS (5.5 optimized)

24

24 Monday, September 22, 2008

slide-25
SLIDE 25

Zero contention message bandwidth?

Interesting relationship between message size and bandwidth

25

25 Monday, September 22, 2008

slide-26
SLIDE 26

Communication in “real world” conditions

Contention matters. (For more, see Abhinav Bhatele’s work at http://charm.cs.uiuc.edu/ .)

26

26 Monday, September 22, 2008

slide-27
SLIDE 27

What about Collectives?

Dependence on vector size is predictable.

27

27 Monday, September 22, 2008

slide-28
SLIDE 28

What can it do?

The machine shines on problems that require lots of communication between processes. TeraByte Sort Three-Dimensional FFT Huge systems of equations

28

28 Monday, September 22, 2008

slide-29
SLIDE 29

TeraByte Sort

Sort 10 billion 100 byte records (10 byte key). Leave out the IO (this isn’t quite the Indy TeraSort benchmark) Use 5600 processors Key Tcomm attributes:

Time to exchange all 1TB is about 4 sec +/- Time to copy each processor’s sublist is about 1 sec +/- Global AllReduce for a 256KB vector is O(10mS)

29

29 Monday, September 22, 2008

slide-30
SLIDE 30

Tuning....

Improved QSort to the model target Bucket assignment is still very slow Exchange is still a little slow We can do better...

30

30 Monday, September 22, 2008

slide-31
SLIDE 31

Three-Dimensional FFT

3D FFT of 1 billion point volume Use PFAFFT (prime factor analysis) complex-complex single precision 1040 x 1040 x 1040 Two target platforms: SC072 -- 72 processors SC1458 -- 1458 processors

31

31 Monday, September 22, 2008

slide-32
SLIDE 32

Results!

FFTW3 is now producing comparable results.

1D FFT 2.91 3D Transpose 1.96 2D FFT 6.37

65 Processor 3D FFT

1D FFT 0.18 3D Transpose 0.25 2D FFT 0.40

1040 Processor 3D FFT

32

32 Monday, September 22, 2008

slide-33
SLIDE 33

Product Directions

Revere the model: Tsol = Tarith/N + Tmem/N + TIO + f(N)Tcomm First generation emphasized Tcomm and TIO Second generation aimed at Tmem and Tarith while taking advantage of technology improvements for Tcomm and TIO More performance per watt/cubic-foot/dollar Richer IO infrastructure “Special purpose” configurations

33

33 Monday, September 22, 2008

slide-34
SLIDE 34

Take-away

SiCortex builds Linux clusters With purposed built components Optimized for high-communication applications High Processor Count Computing

34

34 Monday, September 22, 2008