2010 Blue Waters Performance Modeling Workshop Opening and - - PowerPoint PPT Presentation

2010 blue waters performance modeling workshop opening
SMART_READER_LITE
LIVE PREVIEW

2010 Blue Waters Performance Modeling Workshop Opening and - - PowerPoint PPT Presentation

2010 Blue Waters Performance Modeling Workshop Opening and Introduction Torsten Hoefler With slides from: William Kramer, Marc Snir, William Gropp, IBM, and the Blue Waters team 1 Introduction and Overview My slides contain only public


slide-1
SLIDE 1

2010 Blue Waters Performance Modeling Workshop – Opening and Introduction

Torsten Hoefler

1

With slides from: William Kramer, Marc Snir, William Gropp, IBM, and the Blue Waters team

slide-2
SLIDE 2

Introduction and Overview

  • My slides contain only public information and

will be available online after the workshop

  • No need to take pictures or notes!
  • Parts of tomorrow will contain IBM confidential

information

  • You may only attend the NDA session if your

institution signed and cleared all NDAs for you!

  • You are responsible to maintain the confidentiality
  • f the information!

2

slide-3
SLIDE 3

Blue Waters in a Nutshell

  • >300.000 compute cores
  • based on Power7
  • 10 PF/s peak
  • 1 PF/s sustained
  • >1 PiB RAM
  • >10 PiB disk storage
  • >0.5 EiB archival storage

3

slide-4
SLIDE 4

Performance Modeling for Blue Waters

  • Most users have only experience at

comparatively “small” scale (<8000 cores)

  • Applications should be ready to run on the full

system

  • Needs a clear understanding before system is

deployed (run, tweak, rerun loop not possible)

  • Programmers need to develop a deep

understanding of the application scaling and bottlenecks at scale by performance modeling!

4

slide-5
SLIDE 5

From Chip to Entire Integrated System

5 On-line Storage Near-line Storage

L-Link Cables Super Node

(32 Nodes / 4 CEC)

P7 Chip (8 cores) SMP node (32 cores) Drawer (256 cores) SuperNode (1024 cores) Building Block Blue Waters System NPCF

slide-6
SLIDE 6

6

slide-7
SLIDE 7

Power7 Chip (8 cores)

  • Base Technology
  • 45 nm, 576 mm2
  • 1.2 B transistors
  • Chip
  • 8 cores
  • 4 FMAs/cycle/core
  • 32 MB L3 (private/shared)
  • Dual DDR3 memory
  • 128 GiB/s peak bandwidth
  • (1/2 byte/flop)
  • Clock range of 3.5 – 4 GHz

7

Quad-chip MCM

slide-8
SLIDE 8

L3 Cache/On-Chip Communication

8

  • L1 32KB Instruction / core
  • L1 32KB Data / core
  • L2 = 256KB / core
  • L3 = 4MB eDRAM / core
  • Fast private and

shared region

slide-9
SLIDE 9

MC1 MC 0 8c uP MC1 MC0 8c uP MC0 MC1 8c uP MC0 MC1 8c uP

P7-0 P7-1 P7-3 P7-2

A B X W B A W X B A Y X A B X Y C Z C Z W Z Z W C Y C Y

A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp

MC 0 MC 1 MC 0 MC 1 MC 0 MC 1 MC 0 MC 1

Quad Chip Module (4 chips)

  • 32 cores!
  • 32 cores*8 F/core*4 GHz = 1 TF
  • 4 threads per core (max)
  • 4x32 MiB L3 cache
  • 512 GB/s RAM BW (0.5 B/F)
  • 800 W (0.8 W/F)
  • Flat shared memory!

9

slide-10
SLIDE 10

Adding a Network Interface (Torrent)

10

DIMM 15 DIMM 8 DIMM 9 DIMM 14 DIMM 4 DIMM 12 DIMM 13 DIMM 5 DIMM 11 DIMM 3 DIMM 2 DIMM 10 DIMM 0 DIMM 7 DIMM 6 DIMM 1

MC1 Mem Mem Mem Mem MC 0 Mem Mem 8c uP MC1 Mem Mem Mem Mem MC0 Mem Mem Mem Mem 8c uP 7 Inter-Hub Board Level L-Buses 3.0Gb/s @ 8B+8B, 90% sus. peak

D0-D15 Lr0-Lr23

320 GB/s 240 GB/s

28x XMIT/RCV pairs @ 10 Gb/s 832 624 5+5GB/s (6x=5+1)

Hub Chip Module

22+22GB/s 164 Ll0 22+22GB/s 164 Ll1 22+22GB/s 164 Ll2 22+22GB/s 164 Ll3 22+22GB/s 164 Ll4 22+22GB/s 164 Ll5 22+22GB/s 164 Ll6

12x 12x 12x 12x 12x 12x 12x 12x 12x 12x 12x 12x

10+10GB/s (12x=10+2)

12x 12x 12x 12x 12x 12x 12x 12x 12x 12x 12x 12x 12x 12x 12x 12x

7+7GB/s 72 EG2

PCIe 61x

7+7GB/s 72 EG1

PCIe 16x

7+7GB/s 72 EG2

PCIe 8x MC0 Mem Mem Mem Mem MC1 Mem Mem Mem Mem 8c uP MC0 Mem Mem Mem Mem MC1 Mem Mem Mem Mem 8c uP Mem Mem

P7-0 P7-1 P7-3 P7-2

A B X W B A W X B A Y X A B X Y C Z C Z W Z Z W C Y C Y Z W Y X

A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp A Clk Grp B Clk Grp C Clk Grp D Clk Grp

MC 0 MC 1 MC 0 MC 1 MC 0 MC 1 MC 0 MC 1

DIMM 1

Mem Mem

DIMM 0

Mem Mem

DIMM 5

Mem Mem

DIMM 4

Mem Mem

DIMM 7

Mem Mem

DIMM 6

Mem Mem

DIMM 10

Mem Mem

DIMM 11

Mem Mem

DIMM 2

Mem Mem

DIMM 3

Mem Mem

DIMM 14

Mem Mem

DIMM 15

Mem Mem

DIMM 8

Mem Mem

DIMM 9

Mem Mem

DIMM 12

Mem Mem

DIMM 13

Mem Mem

  • Connects QCM to PCI-e
  • (two 16x and one 8x PCI-e slot)
  • Connects 8 QCM's via low latency,

high bandwidth, copper fabric.

  • Provides a message passing

mechanism with very high bandwidth

  • Provides the lowest possible

latency between 8 QCM's

slide-11
SLIDE 11

EI-3 PHYs

Torrent

Diff PHYs

L local HUB To HUB Copper Board Wiring

L remote 4 Drawer Interconnect to Create a Supernode Optical LR0 Bus Optical

6x 6x

LR23 Bus Optical

6x 6x

LL0 Bus Copper

8B 8B 8B 8B

LL1 Bus Copper

8B 8B

LL2 Bus Copper

8B 8B

LL4 Bus Copper

8B 8B

LL5 Bus Copper

8B 8B

LL6 Bus Copper

8B 8B

LL3 Bus Copper

Diff PHYs

PX0 Bus

16x 16x

PCI-E IO PHY

Hot Plug Ctl

PX1 Bus

16x 16x

PCI-E IO PHY

Hot Plug Ctl

PX2 Bus

8x 8x

PCI-E IO PHY

Hot Plug Ctl FSI FSP1-A FSI FSP1-B I2C TPMD-A, TMPD-B SVIC MDC-A SVIC MDC-B I2C SEEPROM 1 I2C SEEPROM 2

24 L remote Buses

HUB to QCM Connections Address/Data

D Bus Interconnect of Supernodes Optical

D0 Bus Optical

12x 12x

D15 Bus Optical

12x 12x

16 D Buses 28 I2C I2C_0 + Int I2C_27 + Int

I2C To Optical Modules

TOD Sync 8B Z-Bus 8B Z-Bus TOD Sync 8B Y-Bus 8B Y-Bus TOD Sync 8B X-Bus 8B X-Bus TOD Sync 8B W-Bus 8B W-Bus

1.1 TB/s HUB

11

  • 192 GB/s Host Connection
  • 336 GB/s to 7 other local nodes
  • 240 GB/s to local-remote nodes
  • 320 GB/s to remote nodes
  • 40 GB/s to general purpose I/O
slide-12
SLIDE 12

12

First Level Interconnect

  • L-Local
  • HUB to HUB Copper Wiring
  • 256 Cores

DCA-0 Connector (Top DCA) DCA-1 Connector (Bottom DCA) HUB 7 HUB 6 HUB 4 HUB 3 HUB 5 HUB 1 HUB HUB 2

P C I e 9 P C I e 10 P C I e 11 P C I e 12 P C I e 13 P C I e 14 P C I e 15 P C I e 16 P C I e 17

P1-C17-C1

P C I e 1 P C I e 2 P C I e 3 P C I e 4 P C I e 5 P C I e 6 P C I e 7 P C I e 8

Optical Fan-out from HUB Modules 2,304 Fiber 'L-Link'

64/40 Optical 'D-Link' 64/40 Optical 'D-Link'

P7-0 P7-2 P7-3 P7-1 QCM 0 U-P1-M1 P7-0 P7-2 P7-3 P7-1 QCM 1 U-P1-M2 P7-0 P7-2 P7-3 P7-1 QCM 2 U-P1-M3 P7-0 P7-2 P7-3 P7-1 QCM 3 U-P1-M4 P7-0 P7-2 P7-3 P7-1 QCM 4 U-P1-M5 P7-0 P7-2 P7-3 P7-1 QCM 5 U-P1-M6 P7-0 P7-2 P7-3 P7-1 QCM 6 U-P1-M7 P7-0 P7-2 P7-3 P7-1 QCM 7 U-P1-M8

P1-C16-C1 P1-C15-C1 P1-C14-C1 P1-C13-C1 P1-C12-C1 P1-C11-C1 P1-C10-C1 P1-C9-C1 P1-C8-C1 P1-C7-C1 P1-C6-C1 P1-C5-C1 P1-C4-C1 P1-C3-C1 P1-C2-C1 P1-C1-C1

N0-DIMM15 N0-DIMM14 N0-DIMM13 N0-DIMM12 N0-DIMM11 N0-DIMM10 N0-DIMM09 N0-DIMM08 N0-DIMM07 N0-DIMM06 N0-DIMM05 N0-DIMM04 N0-DIMM03 N0-DIMM02 N0-DIMM01 N0-DIMM00 N1-DIMM15 N1-DIMM14 N1-DIMM13 N1-DIMM12 N1-DIMM11 N1-DIMM10 N1-DIMM09 N1-DIMM08 N1-DIMM07 N1-DIMM06 N1-DIMM05 N1-DIMM04 N1-DIMM03 N1-DIMM02 N1-DIMM01 N1-DIMM00 N2-DIMM15 N2-DIMM14 N2-DIMM13 N2-DIMM12 N2-DIMM11 N2-DIMM10 N2-DIMM09 N2-DIMM08 N2-DIMM07 N2-DIMM06 N2-DIMM05 N2-DIMM04 N2-DIMM03 N2-DIMM02 N2-DIMM01 N2-DIMM00 N3-DIMM15 N3-DIMM14 N3-DIMM13 N3-DIMM12 N3-DIMM11 N3-DIMM10 N3-DIMM09 N3-DIMM08 N3-DIMM07 N3-DIMM06 N3-DIMM05 N3-DIMM04 N3-DIMM03 N3-DIMM02 N3-DIMM01 N3-DIMM00 N4-DIMM15 N4-DIMM14 N4-DIMM13 N4-DIMM12 N4-DIMM11 N4-DIMM10 N4-DIMM09 N4-DIMM08 N4-DIMM07 N4-DIMM06 N4-DIMM05 N4-DIMM04 N4-DIMM03 N4-DIMM02 N4-DIMM01 N4-DIMM00 N5-DIMM15 N5-DIMM14 N5-DIMM13 N5-DIMM12 N5-DIMM11 N5-DIMM10 N5-DIMM09 N5-DIMM08 N5-DIMM07 N5-DIMM06 N5-DIMM05 N5-DIMM04 N5-DIMM03 N5-DIMM02 N5-DIMM01 N5-DIMM00 N6-DIMM15 N6-DIMM14 N6-DIMM13 N6-DIMM12 N6-DIMM11 N6-DIMM10 N6-DIMM09 N6-DIMM08 N6-DIMM07 N6-DIMM06 N6-DIMM05 N6-DIMM04 N6-DIMM03 N6-DIMM02 N6-DIMM01 N6-DIMM00 N7-DIMM15 N7-DIMM14 N7-DIMM13 N7-DIMM12 N7-DIMM11 N7-DIMM10 N7-DIMM09 N7-DIMM08 N7-DIMM07 N7-DIMM06 N7-DIMM05 N7-DIMM04 N7-DIMM03 N7-DIMM02 N7-DIMM01 N7-DIMM00

Drawer

  • 8 nodes
  • 32 chips
  • 256 cores
slide-13
SLIDE 13

13

slide-14
SLIDE 14

2nd Level of Interconnect (1,024 cores)

4.6 TB/s Bisection BW BW of 1150 10G-E ports

DCA-0 Connector (Top DCA) DCA-1 Connector (Bottom DCA)

2nd Level Interconnect (1,024 cores)

HUB 7

61x96mm

HUB 6

61x96mm

HUB 4

61x96mm

HUB 3

61x96mm

HUB 5

61x96mm

HUB 1

61x96mm

HUB

61x96mm

HUB 2

61x96mm

P C I e 9 P C I e 10 P C I e 11 P C I e 12 P C I e 13 P C I e 14 P C I e 15 P C I e 16 P C I e 17 P C I e 1 P C I e 2 P C I e 3 P C I e 4 P C I e 5 P C I e 6 P C I e 7 P C I e 8

Optical Fan-out from HUB Modules 2,304 Fiber 'L-Link' 64/40 Optical 'D-Link'

FSP/CLK-A

64/40 Optical 'D-Link'

FSP/CLK-B P7-0 P7-2 P7-3 P7-1 QCM 0 P7-0 P7-2 P7-3 P7-1 QCM 1 P7-0 P7-2 P7-3 P7-1 QCM 2 P7-0 P7-2 P7-3 P7-1 QCM 3 P7-0 P7-2 P7-3 P7-1 QCM 4 P7-0 P7-2 P7-3 P7-1 QCM 5 P7-0 P7-2 P7-3 P7-1 QCM 6 P7-0 P7-2 P7-3 P7-1 QCM 7

DCA-0 Connector (Top DCA) DCA-1 Connector (Bottom DCA)

2nd Level Interconnect (1,024 cores)

HUB 7

61x96mm

HUB 6

61x96mm

HUB 4

61x96mm

HUB 3

61x96mm

HUB 5

61x96mm

HUB 1

61x96mm

HUB

61x96mm

HUB 2

61x96mm

P C I e 9 P C I e 10 P C I e 11 P C I e 12 P C I e 13 P C I e 14 P C I e 15 P C I e 16 P C I e 17 P C I e 1 P C I e 2 P C I e 3 P C I e 4 P C I e 5 P C I e 6 P C I e 7 P C I e 8

Optical Fan-out from HUB Modules 2,304 Fiber 'L-Link' 64/40 Optical 'D-Link'

FSP/CLK-A

64/40 Optical 'D-Link'

FSP/CLK-B P7-0 P7-2 P7-3 P7-1 QCM 0 P7-0 P7-2 P7-3 P7-1 QCM 1 P7-0 P7-2 P7-3 P7-1 QCM 2 P7-0 P7-2 P7-3 P7-1 QCM 3 P7-0 P7-2 P7-3 P7-1 QCM 4 P7-0 P7-2 P7-3 P7-1 QCM 5 P7-0 P7-2 P7-3 P7-1 QCM 6 P7-0 P7-2 P7-3 P7-1 QCM 7

DCA-0 Connector (Top DCA) DCA-1 Connector (Bottom DCA)

2nd Level Interconnect (1,024 cores)

HUB 7

61x96mm

HUB 6

61x96mm

HUB 4

61x96mm

HUB 3

61x96mm

HUB 5

61x96mm

HUB 1

61x96mm

HUB

61x96mm

HUB 2

61x96mm

P C I e 9 P C I e 10 P C I e 11 P C I e 12 P C I e 13 P C I e 14 P C I e 15 P C I e 16 P C I e 17 P C I e 1 P C I e 2 P C I e 3 P C I e 4 P C I e 5 P C I e 6 P C I e 7 P C I e 8

Optical Fan-out from HUB Modules 2,304 Fiber 'L-Link' 64/40 Optical 'D-Link'

FSP/CLK-A

64/40 Optical 'D-Link'

FSP/CLK-B P7-0 P7-2 P7-3 P7-1 QCM 0 P7-0 P7-2 P7-3 P7-1 QCM 1 P7-0 P7-2 P7-3 P7-1 QCM 2 P7-0 P7-2 P7-3 P7-1 QCM 3 P7-0 P7-2 P7-3 P7-1 QCM 4 P7-0 P7-2 P7-3 P7-1 QCM 5 P7-0 P7-2 P7-3 P7-1 QCM 6 P7-0 P7-2 P7-3 P7-1 QCM 7

DCA-0 Connector (Top DCA) DCA-1 Connector (Bottom DCA)

2nd Level Interconnect (1,024 cores)

HUB 7

61x96mm

HUB 6

61x96mm

HUB 4

61x96mm

HUB 3

61x96mm

HUB 5

61x96mm

HUB 1

61x96mm

HUB

61x96mm

HUB 2

61x96mm

P C I e 9 P C I e 10 P C I e 11 P C I e 12 P C I e 13 P C I e 14 P C I e 15 P C I e 16 P C I e 17 P C I e 1 P C I e 2 P C I e 3 P C I e 4 P C I e 5 P C I e 6 P C I e 7 P C I e 8

Optical Fan-out from HUB Modules 2,304 Fiber 'L-Link' 64/40 Optical 'D-Link'

FSP/CLK-A

64/40 Optical 'D-Link'

FSP/CLK-B P7-0 P7-2 P7-3 P7-1 QCM 0 P7-0 P7-2 P7-3 P7-1 QCM 1 P7-0 P7-2 P7-3 P7-1 QCM 2 P7-0 P7-2 P7-3 P7-1 QCM 3 P7-0 P7-2 P7-3 P7-1 QCM 4 P7-0 P7-2 P7-3 P7-1 QCM 5 P7-0 P7-2 P7-3 P7-1 QCM 6 P7-0 P7-2 P7-3 P7-1 QCM 7

14

Second Level Interconnect

  • Optical ‘L-Remote’ Links from HUB
  • 4 drawers
  • 1,024 Cores

L-Link Cables Super Node

(32 Nodes / 4 CEC)

Supernode

slide-15
SLIDE 15

Global Interconnection Network

15

  • This space is intentionally left blank
  • More details in the NDA sessions

A photo of the RAM for distraction.

slide-16
SLIDE 16

National Petascale Computing Facility

16

A facility dedicated to Blue Waters

slide-17
SLIDE 17

Back to Performance Modeling

17

  • Main goals of this workshop:
  • Ignite performance modeling efforts within all

PRAC teams in collaboration with NCSA

  • Start to gather a deep understanding of the

performance characteristics of all codes

slide-18
SLIDE 18

Logistics

18

  • Today:
  • A team from LANL will present a tutorial about

performance modeling

  • Specific examples and use-cases
  • Tomorrow:
  • Hands-on sessions to get modeling of

applications started

  • Supported by LANL and NCSA teams
  • Try to work with your PoC
slide-19
SLIDE 19

NDA issues

19

  • Not all participants are covered by all

necessary NDAs

  • Badges will be marked
  • Please be careful what you talk about
  • You are responsible for the information
  • Everything in my slides can be communicated

freely!

slide-20
SLIDE 20

I’m here to help!

20

  • We have 15 training accounts on a Power 5

available for tomorrow

  • It’s AIX
  • Ask me if you need one
  • Let me know if you have questions, problems,
  • r comments!