High Performance Hardware, High Performance Hardware, Memory & - - PowerPoint PPT Presentation

high performance hardware high performance hardware
SMART_READER_LITE
LIVE PREVIEW

High Performance Hardware, High Performance Hardware, Memory & - - PowerPoint PPT Presentation

High Performance Hardware, High Performance Hardware, Memory & CPU Memory & CPU Step Back, Look Inside, Many Dont Insight, not numbers! Face real world 6 GB/s CPU B K 2 3 cache 2 GB Rubin H. Landau Rubin H. Landau RAM 2


slide-1
SLIDE 1

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 1 1

High Performance Hardware, Memory & CPU High Performance Hardware, Memory & CPU

Rubin H. Landau Rubin H. Landau

With With

Sally Haerer and Scott Clark Sally Haerer and Scott Clark

Computational Physics for Undergraduates Computational Physics for Undergraduates

BS Degree Program: Oregon State University BS Degree Program: Oregon State University

“Engaging People in Cyber Infrastructure Engaging People in Cyber Infrastructure” Suppor Support by EPICS/NSF & OSU by EPICS/NSF & OSU

Main Store cache cache RAM CPU

32 TB @

1 1 1 M b / s

2 MB 2 GB

3 2 K B 6 GB/s

Step Back, Look Inside, Many Don’t Insight, not numbers! Face real world

slide-2
SLIDE 2

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 2 2

Problem: Optimize for Speedup Problem: Optimize for Problem: Optimize for Speedup Speedup

  • θ
  • Faster by smarter (algorithm), not bigger Fe
  • Yet @limit: tune program to architecture
  • 1st locate hot spots  speed up?
  • Negative side
  • hard work & (your) time intensive
  • local hard/software:  portable, readable
  • CS: “compiler's job not yours”
  • CSE: large, complex, frequent programs: 3-5X
  • “CSE: tomorrow’s problems, yesterday’s HdWr

CS ” (Press)

slide-3
SLIDE 3

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 3 3

Theory: Rules of Optimization Theory: Rules of Optimization Theory: Rules of Optimization

1. “More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason - including blind stupidity.” - W.A. Wulf

  • 2. “We should forget about small efficiencies, say about 97% of the

time: premature optimization is the root of all evil.” - D. Knuth

  • 3. “The best is the enemy of the good.”
  • Voltaire
  • 4. Do not do it.

5. (for experts only): “Do not do it yet.” - M.A. Jackson 6. “Do not optimize as you go.” 7. Remember the 80/20 rule: 80% results  20% effort (also 90/10)

  • 8. Always run “before” and “after” benchmarks
  • fast wrong answers not compatible with search for truth/bridges
  • 9. Use the right algorithms and data structures!

Jonathan Hardwich www.cs.cmu.edu/ ~ jch

slide-4
SLIDE 4

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 4 4

Theory: HPC Components Theory: HPC Components Theory: HPC Components

  • Supercomputers = fastest, most powerful
  • Now: parallel machines, PC (WS) based
  • Linux/Unix ($$ if MS)
  • HPC = good balance major components:
  • multistaged (pipelined) units
  • multiple CPU (parallel)
  • fast CPU, but compatible
  • very large, very fast memories
  • very fast communications
  • vector, array processors (?)
  • software: integrates all

Chip (2 processors) Card (2 chips) Board (16 cards) Cabinet (32 boards) 2.8/5.6 Gflops 4 MB 11.2 Gflops 1 GB DDR 80 Gflops 16 GB DDR System (64 cabinets) 5.7 Tflops 512 GB 360 Tflops 32TB

slide-5
SLIDE 5

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 5 5

Memory Hierarchy vs Arrays Memory Hierarchy Memory Hierarchy vs vs Arrays Arrays

A(1) A(2) A(3) A(N) M(1,1) M(2,1) M(3,1) M(N,N) M(N,1) M(1,2) M(2,2) M(3,2)

CPU

A(1) A(2) A(3) A(N) M(1,1) M(2,1) M(3,1) M(N,N) M(N,1) M(1,2) M(2,2) M(3,2)

Swap Space

Page N

RAM

Page 1 Page 2 Page 3 A(1),..., A(16) A(2032),..., A(2048)

Data Cache

Registers

CPU

Row major: C, Java; Column major: F90

  • C, J: m(0,0) m(0,1) m(0,2) m(1,0) m(1,1) m(1,2) m(2,0) m(2,1) m(2,20)
  • F: m(1,1) m(2,1) m(3,1) m(1,2) m(2,2) m(3,2) m(1,3) m(2,3) m(3,3)

Ideal world array storage Real world matrices ≠ blocks = broken lines

slide-6
SLIDE 6

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 6 6

Memory Hierarchy: Cost vs Speed Memory Hierarchy: Cost Memory Hierarchy: Cost vs vs Speed Speed

  • CPU: registers, instructions, FPA, 8 GB/s
  • Cache: high-speed buffer, 5.5 GB/s
  • Cache lines: latency issues
  • RAM: random access memory
  • Via RISC: reduced instruction set computer
  • Hard disk: cheap and slow, 111 Mb/s
  • Pages: length = 4K (386), 8-16K (Unix)
  • Virtual memory ≈  RAM (32b ≈ 4GB)

little effort,  $$ (t) = page faults e.g. multitasking/windows

Main Store cache cache RAM CPU

32 TB @

1 1 1 M b / s

2 MB 2 GB

32 KB 6 GB/s

D C A B

slide-7
SLIDE 7

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 7 7

High Performance Hardware, Memory & CPU (part II) High Performance Hardware, Memory & CPU (part II)

Rubin H. Landau Rubin H. Landau

With With

Sally Haerer and Scott Clark Sally Haerer and Scott Clark

Computational Physics for Undergraduates Computational Physics for Undergraduates

BS Degree Program: Oregon State University BS Degree Program: Oregon State University

“Engaging People in Cyber Infrastructure Engaging People in Cyber Infrastructure” Suppor Support by EPICS/NSF & OSU by EPICS/NSF & OSU

Main Store cache cache RAM CPU

32 TB @

1 1 1 M b / s

2 MB 2 GB

3 2 K B 6 GB/s

(examples)

slide-8
SLIDE 8

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 8 8

Central Processing Unit Central Processing Unit Central Processing Unit

A(1) A(2) A(3) A(N) M(1,1) M(2,1) M(3,1) M(N,N) M(N,1) M(1,2) M(2,2) M(3,2)

Swap Space

Page N

RAM

Page 1 Page 2 Page 3 A(1),..., A(16) A(2032),..., A(2048)

Data Cache

Registers

CPU

  • Pipelines: speed
  • Prepare next step

during previous

  • Bucket brigade

Divide A3 Multiply Fetch f Fetch d A2 Add Fetch b Fetch a A1

Step 4 Step 3 Step 2 Step1 Unit

e.g: c = (a + b) / (d * f)

  • Interacting Memories
slide-9
SLIDE 9

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 9 9

CPU Design: RISC CPU Design: RISC CPU Design: RISC

  • RISC = Reduced I nstruction Set Computer (HPC)
  •  CISC = Complex I SC (previous)
  • high-level microcode on chip (1000’s instructions)
  • complex instructions  slow (10 /instruct)
  • RISC: smaller (simpler) instruction set on chip
  • F90, C compiler translate for RISC architecture
  • simpler (fewer cycles/i), cheaper, possibly faster
  • saved instruction space  more CPU registers
  •  pipelines,  memory conflict, some parallel
  • Theory

CPU T = # instructs  cycles/ instruct  cycle t

CISC: fewer instructs executed RISC: fewer cycles/ instruct

A(1) A(2) A(3) A(N) M(1,1) M(2,1) M(3,1) M(N,N) M(N,1) M(1,2) M(2,2) M(3,2)

Swap Space

Page N

RAM

Page 1 Page 2 Page 3 A(1),..., A(16) A(2032),..., A(2048)

Data Cache

Registers

CPU

slide-10
SLIDE 10

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 10 10

Latest & Greatest: IBM Blue Gene Latest & Greatest: Latest & Greatest: IBM

IBM Blue Gene Blue Gene

  • A. Gara et al., IBM J
  • Specific genes  general SC
  • Linux ($$ if MS)
  • By committee
  • Balance  cost/performance

  performance/watt

  • On, off chip distributed memories
  • 2 cores: 1 compute, 1 communicate
  • Extreme scale  65,536 (216) nodes
  •  Peak = 360 teraflops (1012);
  • Medium speed 5.6 Gflop (cool)
  • 512 chips/card, 16 cards/Board
  • Control: distributed memory MPI

Chip (2 processors) Card (2 chips) Board (16 cards) Cabinet (32 boards) 2.8/5.6 Gflops 4 MB 11.2 Gflops 1 GB DDR 80 Gflops 16 GB DDR System (64 cabinets) 5.7 Tflops 512 GB 360 Tflops 32TB

(double data rate)

slide-11
SLIDE 11

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 11 11

BG's 3 Communication Networks BG BG's 3 Communication Networks 's 3 Communication Networks

  • Fig (a) : 64 x 32 x 32 3-D torus

(2 x 2 x 2 shown) links = chips that also compute both: nearest-neighbor & cut through all ≈ effective bandwidth all nodes node  node: 1.4 Gb/s  1 ns

  • Program speed: local communication

100 ns < Latency < 6.4 s (64 hops)

  • Fig (b) Global collective network

broadcast to all processors

  • Fig (c) Control network + Gb-Ethernet

for I/O, switch, devices > Tb/s

slide-12
SLIDE 12

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 12 12

Blue Gene Compute Heart Blue Gene Blue Gene Compute Heart Compute Heart

  • 2 PowerPC 440s, 2 FPU
  • 1 Compute, 1 I/O (Ether)
  • RISC 7 stage CPU, 3 pipelines
  • Memory  512 MB/node  32 TB
  • Variable page size
  • Three Cache Levels

L1 cache: 32 KB L2 cache: 2 KB L3 cache: 4 MB L1: 32 B line width L2/L3: 128 B

slide-13
SLIDE 13

Introductory Computational Science Introductory Computational Science Rubin Landau, EPIC/OSU 2006 Rubin Landau, EPIC/OSU 2006 13 13

How to use this? How to use this? How to use this?

 Next tim e!