RAMP for Exascale RAMP Wrap August 25th, 2010 Kathy Yelick NERSC - - PowerPoint PPT Presentation

ramp for exascale
SMART_READER_LITE
LIVE PREVIEW

RAMP for Exascale RAMP Wrap August 25th, 2010 Kathy Yelick NERSC - - PowerPoint PPT Presentation

RAMP for Exascale RAMP Wrap August 25th, 2010 Kathy Yelick NERSC Overview NERSC represents science needs Over 3000 users, 400 projects, 500 code instances Over 1,600 publications in 2009 Time is used by university


slide-1
SLIDE 1

RAMP for Exascale

RAMP Wrap August 25th, 2010 Kathy Yelick

slide-2
SLIDE 2

NERSC Overview

NERSC represents science needs

  • Over 3000 users, 400 projects, 500

code instances

  • Over 1,600 publications in 2009
  • Time is used by university

researchers (65%), DOE Labs (25%) and others

2

1 Petaflop Hopper system, late 2010

  • High application performance
  • Nodes: 2 12-core AMD processors
  • Low latency Gemini interconnect
slide-3
SLIDE 3

Science at NERSC

Fusion: Simulations

  • f Fusion devices at

ITER scale Combustion: New algorithms (AMR) coupled to experiments Energy storage: Catalysis for improved batteries and fuel cells Capture & Sequestration: EFRCs Materials: For solar panels and

  • ther

applications. Climate modeling: Work with users on scalability of cloud-resolving models Nano devices: New single molecule switching element

slide-4
SLIDE 4

Algorithm Diversity

Science areas Dense linear algebra Sparse linear algebra Spectral Methods (FFT)s Particle Methods Structured Grids Unstructured or AMR Grids Accelerator Science Astrophysics Chemistry Climate Combustion Fusion Lattice Gauge Material Science

NERSC Qualitative In-Depth Analysis of Methods by Science Area

slide-5
SLIDE 5

Numerical Methods at NERSC

5

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% %Projects % Allocation

  • Caveat: survey data from ERCAP requests based on PI input
  • Allocation is based on hours allocated to a project that use the method
slide-6
SLIDE 6

NERSC Interest in Exascale

107 106 105 104 103 102 10 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Top500

COTS/MPP + MPI

COTS/MPP + MPI (+ OpenMP) GPU CUDA/OpenCL Or Manycore BG/Q, R Exascale + ???

Franklin (N5) 19 TF Sustained 101 TF Peak Franklin (N5) +QC 36 TF Sustained 352 TF Peak Hopper (N6) >1 PF Peak NERSC-7 10 PF Peak NERSC-8 100 PF Peak NERSC-9 1 EF Peak

Peak Teraflop/s

6

Danger: dragging users into a local optimum for programming

slide-7
SLIDE 7

Exascale is really about Energy Efficient Computing

At $1M per MW, energy costs are substantial

  • 1 petaflop in 2010 will use 3 MW
  • 1 exaflop in 2018 possible in 200 MW with “usual” scaling
  • 1 exaflop in 2018 at 20 MW is DOE target

goal usual scaling

2005 2010 2015 2020

7

slide-8
SLIDE 8

The Challenge

  • Power is the leading design constraint in

HPC system design

  • How to get build an exascale system without

building a nuclear power plant next to my HPC center?

  • How can you assure the systems will be

balanced for a reasonable science workload?

  • How do you make it “programmable?”
slide-9
SLIDE 9

Architecture Paths to Exascale

  • Leading Technology Paths (Swim Lanes)

– Multicore: Maintain complex cores, and replicate (x86 and Power7) – Manycore/Embedded: Use many simpler, low power cores from embedded (BlueGene) – GPU/Accelerator: Use highly specialized processors from gaming space (NVidia Fermi, Cell)

  • Risks in Swim Lane selection

– Select too soon: users cannot follow – Select too late: fall behind performance curve – Select incorrectly: Subject users to multiple disruptive changes

  • Users must be deeply engaged in this process

– Cannot leave this up to vendors alone

9

slide-10
SLIDE 10

Green Flash: Overview

John Shalf, PI

  • We present an alternative approach to developing systems to

serve the needs of scientific computing

  • Choose our science target first to drive design decisions
  • Leverage new technologies driven by consumer market
  • Auto-tune software for performance, productivity, and portability
  • Use hardware-accelerated architectural emulation to rapidly

prototype designs (auto-tune the hardware too!)

  • Requires a holistic approach: Must innovate algorithm/

software/hardware together (Co-design)

Achieve 100x energy efficiency improvement

  • ver mainstream HPC approach
slide-11
SLIDE 11

System Balance

  • If you pay 5% more to double the FPUs and get 10% improvement,

it’s a win (despite lowering your % of peak performance)

  • If you pay 2x more on memory BW (power or cost) and get 35%

more performance, then it’s a net loss (even though % peak looks better)

  • Real example: we can give up ALL of the flops to improve memory

bandwidth by 20% on the 2018 system

  • We have a fixed budget

– Sustained to peak FLOP rate is wrong metric if FLOPs are cheap – Balance involves balancing your checkbook & balancing your power budget – Requires a application co-design make the right trade-offs

slide-12
SLIDE 12

The Complexity of Tradeoffs in Exascale System Design

feasible system s

Exascale Performance envelope 20 MW power envelope $200M cost envelope bytes/core envelope

slide-13
SLIDE 13

An Application Driver: Global Cloud Resolving Climate Model

slide-14
SLIDE 14

Computational Requirements

  • ~2 million horizontal subdomains
  • 100 Terabytes of Memory

– 5MB memory per subdomain

  • ~20 million total subdomains

– 20 PF sustained (200PF peak) – Nearest-neighbor communication

  • New discretization for climate model

CSU Icosahedral Code

Must maintain 1000x faster than real time for practical climate simulation

Icosahedral

slide-15
SLIDE 15

An Application Driver: Seismic Exploration

slide-16
SLIDE 16

Seismic Migration

  • Reconstruct the earth’s

subsurface

  • Focus on exploration

which requires a 10 km survey depth

  • Studies the velocity

contrast between different materials below the surface

slide-17
SLIDE 17

Seismic RTM Algorithm

  • Explicit finite difference

method used to approximate wave equation

  • 8th order stencil code
  • Typical survey size is

20 x 30 x 10 km

– Runtime on current clusters is ~ 1 month

slide-18
SLIDE 18

Low Power Design Principles

  • Small is beautiful

– Large array of simple, easy to verify cores – Embrace the embedded market

  • Slower clock frequencies for cubic power

improvement

  • Emphasis on performance per watt
  • Reduce waste by not adding features not

advantageous to science

  • Parallel, manycore processors are the path to

energy efficiency

slide-19
SLIDE 19

Science Optimized Processor Design

  • Make it programmable:

– Hardware Support for PGAS

  • Local store mapped to global address space
  • Direct DMA support between local store to bypass cache

– Logical network topology looks like full crossbar

  • Optimized for small transfers
  • On chip interconnect optimized for problem

communication pattern

  • Directly expose locality for optimized memory

movement

slide-20
SLIDE 20

Cost of Data Movement

MPI

slide-21
SLIDE 21

Cost of Data Movement

MPI Cost of a FLOP

slide-22
SLIDE 22

Not projected to improve much…

Energy Efficiency will require careful management of data locality Important to know when you are on-chip and when data is off-chip!

slide-23
SLIDE 23

Vertical Locality Management

  • Movement of data up and

down cache hierarchy

– Cache virtualizes notion of

  • n-chip off-chip

– Software managed memory (local store) is hard to program (cell)

  • Virtual Local store

– Use conventional cache for portability – Only use SW managed memory only for performance critical code – Repartition as needed

slide-24
SLIDE 24

Horizontal Locality Management

  • Movement of data between processors

– 10x lower latency and 10x higher bandwidth on-chip – Need to minimize distance of horizontal data movement

  • Encode Horizontal locality into memory address

– Hardware hierarchy where high-order bits encode cabinet and low-order bits encode chip-level distance

slide-25
SLIDE 25

Application Analysis - Climate

  • Analyze each loop within

climate code

– Extract temporal reuse and bandwidth requirements

  • Use traces to determine

cache size and DRAM BW requirements

  • Ensure memory hierarchy

can support application

slide-26
SLIDE 26

Application Optimization - Climate

  • Original code:

– 160KB Cache requirement – < 50% FP Instructions

  • Tuned Code:

– 1KB Cache Requirement – > 85% FP instructions

Loop optimization resulted in 160x reduction in cache size and a 2x increase in execution speed

slide-27
SLIDE 27

Co-design Advantage - Climate

slide-28
SLIDE 28

Co-design Advantage - Climate

slide-29
SLIDE 29

Co-design Advantage - Seismic

Co-design Perform ance Advantage for Seism ic RTM

0.0 500.0 1000.0 1500.0 2000.0 2500.0 3000.0 3500.0 4000.0 4500.0

8-core Nehalem Fermi Green Wave MPoints / sec

slide-30
SLIDE 30

Co-design Advantage - Seismic

Co-design Pow er Advantage for Seism ic RTM

20 40 60 80 100 120 140 160

8-core Nehalem Fermi Tensilica GreenWave MPoints / W att

slide-31
SLIDE 31

Extending to General Stencil Codes

  • Generalized co-tuning framework being developed
  • Co-tuning framework applied to multiple architectures

– Manycore and GPU support

  • Significant advantage over tuning only HW or SW

– ~3x Power and Area advantage gained

Gradient Gradient

slide-32
SLIDE 32

RAMP Infrastructure

  • FPGA Emulation critical for HW / SW co-design

– Enable full application benchmarking – Autotuning target – Feedback path from HW architect from application developers

  • RAMP Gateware and methodology needed to

glue processors together

slide-33
SLIDE 33

Green Flash Impact

  • Clear demonstration of improved performance

per / watt on multiple scientific codes

– X improvement over current architectures – Future HPC systems are power limited – Green Flash methodology provides a path to exascale

  • DOE has responded by establishing exascale

co-design centers around the nation

– Green Flash and RAMP have played a key role in affecting this shift – DOE funding for ISIS, CoDEX, and application Co- Design centers

slide-34
SLIDE 34

Co-Design is Critical for Exascale

  • Major changes ahead for computing,

including HPC

–Resign algorithms to minimize communication (communication-avoiding/optimal) –Development of programming models to allow for communication control –Feedback to architecture designs

  • How much data-parallelism, local stores, etc.

–Develop science applications –Reduce risk in Exascale program

IP issues critical to success of co-design

slide-35
SLIDE 35

Co-Design Before its Time

  • Green Flash Demo’d during SC ‘09
  • CSU atmospheric model ported to

low-power core design

– Dual Core Tensilica processors running atmospheric model at 25MHz – MPI Routines ported to custom Tensilica Interconnect

  • Memory and processor Stats available

for performance analysis

  • Emulation performance advantage

– 250x Speedup over merely function software simulator

  • Actual code running - not

representative benchmark

Icosahedral mesh for algorithm scaling

slide-36
SLIDE 36

More Info

  • Green Flash

–http://www.lbl.gov/CS/html/greenflash.html –http://www.lbl.gov/CS/html/greenmeetings.html

  • NERSC Advanced Technology Group

–http://www.nersc.gov/projects/SDSA