Housekeeping Twitter: # ACMLearning Welcome to todays ACM - - PowerPoint PPT Presentation

housekeeping
SMART_READER_LITE
LIVE PREVIEW

Housekeeping Twitter: # ACMLearning Welcome to todays ACM - - PowerPoint PPT Presentation

Housekeeping Twitter: # ACMLearning Welcome to todays ACM Learning Webinar , Current Trends in High Performance Computing and Challenges for the Future with Jack Dongarra. The presentation starts at the top of the hour and


slide-1
SLIDE 1

“Housekeeping”

Twitter: # ACMLearning

  • Welcome to today’s ACM Learning Webinar, “Current Trends in High Performance Computing and Challenges

for the Future” with Jack Dongarra. The presentation starts at the top of the hour and lasts 60 minutes. Slides will advance automatically throughout the event. You can resize the slide area as well as other windows by dragging the bottom right corner of the slide window, as well as move them around the screen. On the bottom panel you’ll find a number of widgets, including Twitter, Sharing, and Wikipedia apps.

  • I f you are experiencing any problem s w ith audio or video, refresh your console by pressing the F5

key on your keyboard in Windows, Com m and + R if on a Mac, or refresh your browser if you’re on a mobile device; or close and re-launch the presentation. You can also view the Webcast Help Guide, by clicking on the “Help” widget in the bottom dock.

  • To control volume, adjust the master volume on your computer. If the volume is still too low, use

headphones.

  • If you think of a question during the presentation, please type it into the Q&A box and click on the submit
  • button. You do not need to wait until the end of the presentation to begin submitting questions.
  • At the end of the presentation, you’ll see a survey open in your browser. Please take a minute to fill it out to

help us improve your next webinar experience.

  • You can download a copy of these slides by clicking on the Resources widget in the bottom dock.
  • This session is being recorded and will be archived for on-demand viewing in the next 1-2 days. You will

receive an automatic email notification when it is available, and check http: / / learning.acm.org/ in a few days for updates. And check out http: / / learning.acm.org/ webinar for archived recordings of past webcasts.

slide-2
SLIDE 2
  • Learning Center tools for professional development: http: / / learning.acm.org
  • 4,900+ trusted technical books and videos from O’Reilly, Morgan Kaufmann, etc.
  • 1,400+ courses, virtual labs, test preps, live mentoring for software professionals covering

programming, data management, cybersecurity, networking, project management, more

  • 30,000+ task-based short videos for “just-in-time” learning
  • Training toward top vendor certifications (CEH, Cisco, CISSP

, CompTIA, ITIL, PMI, etc.)

  • Learning Webinars from thought leaders and top practitioner (http: / / webinar.acm.org)
  • Podcast interviews with innovators, entrepreneurs, and award winners
  • Popular publications:
  • Flagship Communications of the ACM (CACM) magazine: http: / / cacm.acm.org/
  • ACM Queue magazine for practitioners: http: / / queue.acm.org/
  • ACM Digital Library, the world’s most comprehensive database of computing literature:

http: / / dl.acm.org.

  • International conferences that draw leading experts on a broad spectrum of computing topics:

http: / / www.acm.org/ conferences.

  • Prestigious awards, including the ACM A.M. Turing and ACM Prize in Computing:

http: / / awards.acm.org

  • And much more…

http: / / www.acm.org.

ACM Highlights

slide-3
SLIDE 3

“Housekeeping”

Twitter: # ACMLearning

  • Welcome to today’s ACM Learning Webinar, “Current Trends in High Performance Computing and Challenges

for the Future” with Jack Dongarra. The presentation starts at the top of the hour and lasts 60 minutes. Slides will advance automatically throughout the event. You can resize the slide area as well as other windows by dragging the bottom right corner of the slide window, as well as move them around the screen. On the bottom panel you’ll find a number of widgets, including Twitter, Sharing, and Wikipedia apps.

  • I f you are experiencing any problem s w ith audio or video, refresh your console by pressing the F5

key on your keyboard in Windows, Com m and + R if on a Mac, or refresh your browser if you’re on a mobile device; or close and re-launch the presentation. You can also view the Webcast Help Guide, by clicking on the “Help” widget in the bottom dock.

  • To control volume, adjust the master volume on your computer. If the volume is still too low, use

headphones.

  • If you think of a question during the presentation, please type it into the Q&A box and click on the submit
  • button. You do not need to wait until the end of the presentation to begin submitting questions.
  • At the end of the presentation, you’ll see a survey open in your browser. Please take a minute to fill it out to

help us improve your next webinar experience.

  • You can download a copy of these slides by clicking on the Resources widget in the bottom dock.
  • This session is being recorded and will be archived for on-demand viewing in the next 1-2 days. You will

receive an automatic email notification when it is available, and check http: / / learning.acm.org/ in a few days for updates. And check out http: / / learning.acm.org/ webinar for archived recordings of past webcasts.

slide-4
SLIDE 4

Talk Back

  • Use Twitter widget to Tweet your favorite quotes

from today’s presentation with hashtag # ACMLearning

  • Submit questions and comments via Twitter to

@acmeducation – we’re reading them!

  • Use the sharing widget in the bottom panel to

share this presentation with friends and colleagues.

slide-5
SLIDE 5

2/7/2017 5

Current Trends in High Performance Computing and Challenges for the Future

Jack Dongarra

University of Tennessee Oak Ridge National Laboratory University of Manchester

slide-6
SLIDE 6

Outline

  • Overview of High Performance

Computing

  • Directions for the Future

6

slide-7
SLIDE 7

7

S imulation: The Third Pillar of S cience

  • Traditional scientific and engineering paradigms:

1) Do theory or paper design. 2) Perform experiments or build physical system.

  • Limitations:
  • Too difficult -- build large wind tunnels.
  • Too expensive -- build a throw-away passenger jet.
  • Too slow -- wait for climate or galactic evolution.
  • Too dangerous -- weapons, drug design, climate

experimentation.

  • Computational science paradigm:

3) Use high performance computer systems to simulate the phenomenon

  • Base on known physical laws and efficient numerical methods.
slide-8
SLIDE 8

Wide Range of Applications that Depend on HPC is Incredibly Broad and Diverse

  • Airplane wing design,
  • Quantum chemistry,
  • Geophysical flows,
  • Noise reduction,
  • Diffusion of solid bodies in a liquid,
  • Computational materials research,
  • Weather forecasting,
  • Deep learning in neural networks,
  • S

tochastic simulation,

  • Massively parallel data mining,

8

slide-9
SLIDE 9

S tate of S upercomputing in 2017

  • Pflops (> 1015 Flop/s) computing fully established with

117 computer systems.

  • Three technology architecture or “swim lanes” are

thriving.

  • Commodity (e.g. Intel)
  • Commodity + accelerator (e.g. GPUs) (88 systems)
  • Lightweight cores (e.g. IBM BG, ARM, Intel’s Knights Landing)
  • Interest in supercomputing is now worldwide, and

growing in many new markets (~50%

  • f Top500 computers are

in industry).

  • Exascale (1018 Flop/s) projects exist in many countries

and regions.

  • Intel processors largest share, 92%

, followed by AMD, 1% .

9

slide-10
SLIDE 10

10

  • H. Meuer, H. Simon, E. Strohmaier, & JD
  • Listing of the 500 most powerful

Computers in the World

  • Yardstick: Rmax from LINPACK MPP

Ax=b, dense problem

  • Updated twice a year

SC‘xy in the States in November Meeting in Germany in June

  • All data available from www.top500.org

Size Rate

TPP performance

slide-11
SLIDE 11

0.1 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016

Performance Development of HPC over the Last 24 Years from the Top500

59.7 GFlop/s 400 MFlop/s 1.17 TFlop/s 93 PFlop/s 349 TFlop/s 672 PFlop/s

SUM N=1 N=500

1 Gflop/s 1 Tflop/s

100 Mflop/ s 100 Gflop/ s 100 Tflop/ s 10 Gflop/ s 10 Tflop/ s

1 Pflop/s

100 Pflop/ s 10 Pflop/ s

1 Eflop/s

My Laptop: 70 Gflop/s My iPhone & iPad: 4 Gflop/s

slide-12
SLIDE 12

PERFORMANCE DEVELOPMENT

1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

SUM N=1 N=100

1 Gflop/s 1 Tflop/s

100 Gflop/ s 100 Tflop/ s 10 Gflop/ s 10 Tflop/ s

1 Pflop/s

100 Pflop/ s 10 Pflop/ s

1 Eflop/s

N=10 Tflops (1012) Achieved ASCI Red Sandia NL Pflops (1015) Achieved RoadRunner Los Alamos NL Eflops (1018) Achieved?

12

China says 2020 U.S. says 2021

slide-13
SLIDE 13

November 2016: The TOP 10 Systems

Rank nk Si Site Compu puter Coun untry ry Core res Rma max [Pflops ps] % o % of Peak Powe wer [MW MW] GFlops ps/ Watt tt 1 Nationa nal Su Super Comput uter C r Center in r in Wuxi Wuxi Su Sunw nway Taih ihuL uLig ight, S SW2601 W2601 0 (260C 260C) + + Cus ustom China na 1 0, 0, 649, 649, 000 000 93.

  • 93. 0

74 74 1 5.

  • 5. 4

6.

  • 6. 04

04 2 Natio ional S Sup uper r Comput uter C r Center in r in Gua Guangzhou Tianh nhe- 2 NUDT DT, Xeon

  • n (1

(1 2C) + + IntelXe lXeon Phi ( i (57C 57C) + C Custo tom China na 3, 3, 1 20, 20, 000 000 33.

  • 33. 9

62 62 1 7.

  • 7. 8

1 . 91 91 3 DO DOE / O / OS Oak R Ridge ge N Nat t La Lab Titan, n, Cra ray XK7, 7, A AMD ( (1 6C 6C) + Nvid idia ia Kep epler er GP GPU ( (1 4C 4C) + + Custo tom USA SA 560, 560, 640 640 1 7.

  • 7. 6

65 65 8.

  • 8. 21

21 2.

  • 2. 1 4

4 DO DOE / N / NNSA L L Liv iverm rmore re Nat La t Lab Sequo quoia ia, Bl Blue ueGe Gene/Q Q (1 (1 6C) ) + cu custom USA SA 1 , 572, 572, 864 864 1 7.

  • 7. 2

85 85 7.

  • 7. 89

89 2.

  • 2. 1 8

5 DO DOE / O / OS L Be Berke rkeley Nat La t Lab Cori, ri, C Cra ray X XC40, 40, Xeon P Phi i (68C 68C) + C Custo tom USA SA 622, 622, 336 336 1 4.

  • 4. 0

50 50 3.

  • 3. 94

94 3.

  • 3. 55

55 6 Joint nt Center f r for r Adv dvanced H d HPC Oakf kfore rest- PACS, F Fuj ujit itsu u Prim rimerg rgy CX1 640, 640, X Xeon Phi ( i (68C 68C) + O Omni ni- Path th Jap apan an 558, 558, 1 44 44 1 3.

  • 3. 6

54 54 2.

  • 2. 72

72 4.

  • 4. 98

98 7 RIKEN A Advance ced In Inst f for

  • r C

Com

  • mp Sci

ci K comput uter F r Fuj ujit itsu u SPARC64 64 VIIIf IIIfx (8 (8C) + ) + Custom

  • m

Jap apan an 705, 705, 024 024 1 0.

  • 0. 5

93 93 1 2.

  • 2. 7

. 827 827 8 Sw Swiss C CSC SCS Piz iz Daint nt, Cra ray X XC50, 50, Xeon (1 (1 2C) ) + + Nvid idia ia P1 00 00(56C 56C) + + Custo tom Swi wiss 206, 206, 720 720 9.

  • 9. 78

78 61 61 1 . 31 31 7.

  • 7. 45

45 9 DO DOE / O / OS Argonne nne Nat La t Lab Mira ira, Bl Blue ueGe Gene/Q Q (1 (1 6C) ) + Custo tom USA SA 786, 786, 432 432 8.

  • 8. 59

59 85 85 3.

  • 3. 95

95 2.

  • 2. 07

07 1 0 1 0 DO DOE / N / NNSA / / Los s Alamo mos & s & Sandi dia Trin rinit ity, Cra ray X XC40, 40, Xeon (1 6C 6C) + Cust stom m USA SA 301 301 , 056 056 8.

  • 8. 1 0

80 80 4.

  • 4. 23

23 1 . 92 92 500 Internet company Inspur Intel (8C) + Nnvidia China 5440 .286 71

TaihuLight is 5.2 X Performance of Titan TaihuLight is 1.1 X Sum of All DOE Systems

slide-14
SLIDE 14

Recent Developments

  • US DOE planning to deploy O(100) Pflop/s systems for 2017-

2018 - $525M hardware

  • Oak Ridge Lab and Lawrence Livermore Lab to receive IBM

and Nvidia based systems

  • Argonne Lab to receive Intel based system
  • After this Exascale systems
  • US Dept of Commerce is preventing some China

groups from receiving Intel technology

  • Citing concerns about nuclear research being done with the

systems; February 2015.

  • On the blockade list:
  • National SC Center Guangzhou, site of Tianhe-2
  • National SC Center Tianjin, site of Tianhe-1A
  • National University for Defense Technology, developer
  • National SC Center Changsha, location of NUDT

14

slide-15
SLIDE 15

Toward Exascale

  • China plans for Exascale: 2020
  • Three separate developments in HPC; “Anything but from the US”
  • Wuxi
  • Follow on to TaihuLight O(100) Pflops all Chinese
  • National University for Defense Technonlogy
  • Upgrade Tianhe-2A O(100) Pflops will be Chinese ARM processor +

accelerator

  • Sugon - CAS ICT
  • X86 based, Chinese made; collaboration with AMD
  • US Dept of Energy; Exascale Computing Program (ECP)
  • 7 Year Program
  • Initial exascale system based on adv

dvanced a d arc rchit itecture ure and delivered in 2021

  • Enable ca

capable le exasc scale systems, based on ECP R&D, delivered in 2022 and deployed in 2023

15

slide-16
SLIDE 16
  • ShenWei SW26010 Processor
  • Vendor: Shanghai High Performance IC Design Center
  • Supported by National Science and Technology Major

Project (NMP): Core Electronic Devices, High-end Generic Chips, and Basic Software

  • 28 nm technology
  • 260 Cores
  • 3 Tflop/s peak

China’s First Homegrown Many ny-core Processo ssor

slide-17
SLIDE 17

Sunway TaihuLight http://bit.ly/sunway-2016

  • SW26010 processor
  • Chinese design, fab, and ISA
  • 1.45 GHz
  • Node = 260 Cores (1 socket)
  • 4 – core groups
  • 64 CPE, No cache, 64 KB scratchpad/CPE
  • 1 MPE w/32 KB L1 dcache & 256KB L2 cache
  • 32 GB memory total, 136.5 GB/s
  • ~3 Tflop/s, (22 flops/byte)
  • Cabinet = 1024 nodes
  • 4 supernodes=32 boards(4 cards/b(2 node/c))
  • ~3.14 Pflop/s
  • 40 Cabinets in system
  • 40,960 nodes total
  • 125 Pflop/s total peak
  • 10,649,600 cores total
  • 1.31 PB of primary memory (DDR3)
  • 93 Pflop/s for HPL Benchmark, 74% peak
  • 15.3 MWatts, water cooled
  • 6.07 Gflop/s per Watt
  • 1.8B RMBs ~ $280M, (building, hw, apps, sw, …)
slide-18
SLIDE 18

Gordon Bell Award

18

  • Since 1987 the ACM’s Gordon Bell Prize is awarded at the

ACM/IEEE Supercomputing Conference (SC) to recognize

  • utstanding achievement in high-performance computing.
  • The purpose of the award is to track the progress of parallel

computing, with emphasis on rewarding innovation in applying HPC to applications.

  • Financial support of the $10,000 award is provided by

Gordon Bell, a pioneer in high-performance and parallel computing.

  • Authors mark their SC paper as a possible Gordon Bell Prize

competitor.

  • Gordon Bell Committee reviews the papers and selects 6

papers as finalists for the competition.

  • Presentations are made at SC and a winner is chosen.
slide-19
SLIDE 19

Gordon Bell Award 6 Finalists at SC16 in November

  • “Modeling Dilute Solutions Using First-Principles Molecular Dynamics: Computing

More than a Million Atoms with Over a Million Cores,”

  • Lawrence-Livermore National Laboratory (Calif.)
  • “Towards Green Aviation with Python at Petascale,”
  • Imperial College London (England)
  • “Simulations of Below-Ground Dynamics of Fungi: 1.184 Pflops Attained by

Automated Generation and Autotuning of Temporal Blocking Codes,”

  • RIKEN (Japan), Chiba University (Japan), Kobe University (Japan) and Fujitsu Ltd. (Japan)
  • “Extreme-Scale Phase Field Simulations of Coarsening Dynamics on the Sunway

Taihulight Supercomputer,”

  • Chinese Academy of Sciences, the University of South Carolina, Columbia University (New York), the

National Research Center of Parallel Computer Engineering and Technology (China) and the National Supercomputing Center in Wuxi (China)

  • “A Highly Effective Global Surface Wave Numerical Simulation with Ultra-High

Resolution,”

  • First Institute of Oceanography (China), National Research Center of Parallel Computer Engineering and

Technology (China) and Tsinghua University (China)

  • “10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric

Dynamics,”

  • Chinese Academy of Sciences, Tsinghua University (China), the National Research Center of Parallel

Computer Engineering and Technology (China) and Beijing Normal University (China)

19

slide-20
SLIDE 20

HPE, 112, 22% SGI, 28, 6% Lenovo, 96, 19% Cray Inc., 56, Sugon, 47, 9% IBM, 36, 7% Bull, Atos, 20, 4% Huawei, 16, 3% Inspur, 18, 4% Dell, 13, 3% Fujitsu, 11, 2% NUDT, 4, 1% Others, 43, 9% HPE SGI Lenovo Cray Inc. Sugon IBM Bull, Atos Huawei Inspur

VENDORS / SYSTEM SHARE

# of systems, % of 500

slide-21
SLIDE 21

HPE, 112, 22% SGI, 28, 6% Lenovo, 96, 19% Cray Inc., 56, Sugon, 47, 9% IBM, 36, 7% Bull, Atos, 20, 4% Huawei, 16, 3% Inspur, 18, 4% Dell, 13, 3% Fujitsu, 11, 2% NUDT, 4, 1% Others, 43, 9% HPE SGI Lenovo Cray Inc. Sugon IBM Bull, Atos Huawei Inspur

VENDORS / SYSTEM SHARE

# of systems, % of 500

36% of the vendors are from China

slide-22
SLIDE 22

Countries Share of Top500

China has 1/3 of the systems, while the number of systems in the US has fallen to the lowest point since the TOP500 list was created.

34 5 5 5 6 7 17 20 27 32 171 171

50 100 150 200

Others India Russia Saudi Arabia Italy Poland UK France Japan Germany China US

Number of S ystims on Top500

Each rectangle represents one of the Top500 computers, area of rectangle reflects its performance.

slide-23
SLIDE 23

23

Confessions of an Accidental Benchmarker

  • Appendix B of the Linpack Users’ Guide
  • Designed to help users extrapolate execution time for

Linpack software package

  • First benchmark report from 1977;

Began in late 70’s Time when floating point

  • perations were expensive

compared to other

  • perations and data

movement

slide-24
SLIDE 24

http://tiny.cc/hpcg

Many Other Benchmarks

  • TOP500
  • Green 500
  • Graph 500
  • Sustained Petascale

Performance

  • HPC Challenge
  • Perfect
  • ParkBench
  • SPEC-hpc
  • Big Data Top100
  • Livermore Loops
  • EuroBen
  • NAS Parallel Benchmarks
  • Genesis
  • RAPS
  • SHOC
  • LAMMPS
  • Dhrystone
  • Whetstone
  • I/O Benchmarks
  • WRF
  • Yellowstone
  • Roofline
  • Neptune

24

slide-25
SLIDE 25

LINPACK Benchmark High Performance Linpack (HPL)

  • Is a widely recognized and discussed metric for ranking

high performance computing systems

  • When HPL gained prominence as a performance metric in

the early 1990s there was a strong correlation between its predictions of system rankings and the ranking that full-scale applications would realize.

  • Computer system vendors pursued designs that

would increase their HPL performance, which would in turn improve overall application performance.

  • Today HPL remains valuable as a measure of historical

trends, and as a stress test, especially for leadership class systems that are pushing the boundaries of current technology.

25

slide-26
SLIDE 26

The Problem

  • HPL performance of computer systems are no longer so

strongly correlated to real application performance, especially for the broad set of HPC applications governed by partial differential equations.

  • Designing a system for good HPL performance can

actually lead to design choices that are wrong for the real application mix, or add unnecessary components or complexity to the system.

26

slide-27
SLIDE 27

Peak Performance - Per Core

Floating point operations per cycle per core

 Most of the recent computers have FMA (Fused multiple add): (i.e.

x ←x + y*z in one cycle)

 Intel Xeon earlier models and AMD Opteron have SSE2

 2 flops/cycle DP & 4 flops/cycle SP

 Intel Xeon Nehalem (’09) & Westmere (’10) have SSE4

 4 flops/cycle DP & 8 flops/cycle SP

 Intel Xeon Sandy Bridge(’11) & Ivy Bridge (’12) have AVX

 8 flops/cycle DP & 16 flops/cycle SP

 Intel Xeon Haswell (’13) & (Broadwell (’14)) AVX2

 16 flops/cycle DP & 32 flops/cycle SP  Xeon Phi (per core) is at 16 flops/cycle DP & 32 flops/cycle SP

 Intel Xeon Skylake (server) AVX 512

 32 flops/cycle DP & 64 flops/cycle SP  Knight’s Landing We are here

slide-28
SLIDE 28

CPU Access Latencies in Clock Cycles

In 167 cycles can do 2672 DP Flops

Cycles

Cycles

Today floating point operations are inexpensive Data movement is very expensive

slide-29
SLIDE 29

Many Problems in Computational Science Involve Solving PDEs; Large Sparse Linear Systems

  • ver some domain

( where P denotes the differential operator )

+

boundary conditions Discretization (e.g., Galerkin equations)

aji bj

Basis functions Φj are often with local support, e.g., leading to local interactions & hence sparse matrices, e.g.,

10 100 115 201 35 332

Find uh = Φi xi (P uh, Φj) = (f, Φi) for Φj  (PΦi , Φj) xi = (f, Φi)  Sparse Linear System A x = b

row 10 in this case will have only 6 non-zeroes: a10,10, a10,332, a10,100, a10,115, a10,201, a10,35 Given a PDE, e.g.:

Modeling Diffusion Fluid Flow

slide-30
SLIDE 30

HPCG

  • High Performance Conjugate Gradients (HPCG).
  • Solves Ax=b, A large, sparse, b known, x computed.
  • An optimized implementation of PCG contains essential computational

and communication patterns that are prevalent in a variety of methods for discretization and numerical solution of PDEs

  • Synthetic discretized 3D PDE (FEM, FVM, FDM).
  • Sparse matrix:
  • 27 nonzeros/row interior.
  • 8 – 18 on boundary.
  • Symmetric positive definite.
  • Patterns:
  • Dense and sparse computations.
  • Dense and sparse collectives.
  • Multi-scale execution of kernels via MG (truncated) V cycle.
  • Data-driven parallelism (unstructured sparse triangular solves).
  • Strong verification (via spectral properties of PCG).

30

hpcg-benchmark.org

slide-31
SLIDE 31

HPCG Results, Nov 2016, 1-10

# Site Computer Cores HPL Pflops HPCG Pflops % of Peak

1 RIKEN Advanced Institute for Computational Science K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect

705,024

10.5 0.603 5.3% 2 NSCC / Guangzhou Tianhe-2 NUDT, Xeon 12C 2.2GHz + Intel Xeon Phi 57C + Custom

3,120,000

33.8 0.580 1.1% 3 Joint Center for Advanced HPC, Japan Oakforest-PACS – PRIMERGY CX600 M1, Intel Xeon Phi

557,056

24.9 0.385 2.8% 4 National Supercomputing Center in Wuxi, China Sunway TaihuLight – Sunway MPP, SW26010

10,649,600

93.0 0.3712 0.3% 5 DOE/SC/LBNL/NERSC USA Cori – XC40, Intel Xeon Phi Cray

632,400

13.8 0.355 1.3% 6 DOE/NNSA/LLNL USA Sequoia – IBM BlueGene/Q, IBM

1,572,864

17.1 0.330 1.6% 7 DOE/SC/Oak Ridge Nat Lab Titan - Cray XK7 , Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x

560,640

17.5 0.322 1.2% 8 DOE/NNSA/LANL/SNL Trinity - Cray XC40, Intel E5-2698v3, Aries custom

301,056

8.10 0.182 1.6% 9 NASA / Mountain View Pleiades - SGI ICE X, Intel E5-2680, E5-2680V2, E5-2680V3, Infiniband FDR

243,008

5.90 0.175 2.5% 10 DOE/SC/Argonne National Laboratory Mira - BlueGene/Q, Power BQC 16C 1.60GHz, Custom

786,432

8.58 0.167 1.7%

slide-32
SLIDE 32

Comparison Peak, HPL, & HPCG

32

slide-33
SLIDE 33

Comparison Peak, HPL, & HPCG

33

slide-34
SLIDE 34

Classical Analysis of Algorithms May Not be Valid

  • Processors over provisioned for

floating point arithmetic

  • Data movement extremely expensive
  • Operation count is not a good

indicator of the time to solve a problem.

  • Algorithms that do more ops may

actually take less time.

2/7/2017 34

slide-35
SLIDE 35

68 cores Intel Xeon Phi KNL, 1.3 GHz The theoretical peak double precision is 2662 Gflop/s Compiled with icc and using Intel MKL 2017b1 20160506

Level 1, 2 and 3 BLAS

68 cores Intel Xeon Phi KNL, 1.3 GHz, Peak DP = 2662 Gflop/s

80.3 Gflop/s 35.1 Gflop/s 2100 Gflop/s

35x

slide-36
SLIDE 36

Singular Value Decomposition LAPACK Version 1991

Level 1, 2, & 3 BLAS First Stage 8/3 n3 Ops

Dual socket – 8 core Intel Sandy Bridge 2.6 GHz (8 Flops per core per cycle)

QR refers to the QR algorithm for computing the eigenvalues

LAPACK QR (BLAS in ||, 16 cores) LAPACK QR (using1 core)(1991) LINPACK QR (1979) EISPACK QR (1975)

3 Generations of software compared

slide-37
SLIDE 37

Bottleneck in the Bidiagonalization The Standard Bidiagonal Reduction: xGEBRD

Two Steps: Factor Panel & Update Tailing Matrix

Characteristics

  • Total cost 8n3/3, (reduction to bi-diagonal)
  • Too many Level 2 BLAS operations
  • 4/3 n3 from GEMV and 4/3 n3 from GEMM
  • Performance limited to 2* performance of GEMV
  • Memory bound algorithm.

factor panel k then update  factor panel k+1

Q*A*PH

Requires 2 GEMVs

slide-38
SLIDE 38

Recent Work on 2-Stage Algorithm

Characteristics

  • Stage 1:
  • Fully Level 3 BLAS
  • Dataflow Asynchronous execution
  • Stage 2:
  • Level “BLAS-1.5”
  • Asynchronous execution
  • Cache friendly kernel (reduced communication)

First stage To band Second stage Bulge chasing To bi-diagonal

slide-39
SLIDE 39

First stage To band Second stage Bulge chasing To bi-diagonal

More Flops, original did 8/3 n3 25% More flops

Recent work on developing new 2-stage algorithm

slide-40
SLIDE 40

Recent work on developing new 2-stage algorithm

First stage To band Second stage Bulge chasing To bi-diagonal

25% More flops and 1.8 – 6 times faster

16 Sandy Bridge cores 2.6 GHz

slide-41
SLIDE 41

Critical Issues at Peta & Exascale for Algorithm and S

  • ftware Design
  • Synchronization-reducing algorithms
  • Break Fork-Join model
  • Communication-reducing algorithms
  • Use methods which have lower bound on communication
  • Mixed precision methods
  • 2x speed of ops and 2x speed for data movement
  • Autotuning
  • Today’s machines are too complicated, build “smarts” into

software to adapt to the hardware

  • Fault resilient algorithms
  • Implement algorithms that can recover from failures/bit flips
  • Reproducibility of results
  • Today we can’t guarantee this. We understand the issues,

but some of our “colleagues” have a hard time with this.

slide-42
SLIDE 42

Collaborators and Support

MAGMA team

http://icl.cs.utk.edu/magma

PLASMA team

http://icl.cs.utk.edu/plasma

Collaborating partners

University of Tennessee, Knoxville Lawrence Livermore National Laboratory, Livermore, CA University of California, Berkeley University of Colorado, Denver INRIA, France (StarPU team) KAUST, Saudi Arabia

slide-43
SLIDE 43

ACM: The Learning Continues…

  • Questions about this webcast?

learning@acm.org

  • ACM Learning Webinars (on-demand archive):

http://webinar.acm.org/

  • ACM Learning Center: http://learning.acm.org
  • ACM SIGHPC: http://www.sighpc.org/.