Analyzing the Effect of Different Programming Models Upon - PowerPoint PPT Presentation

Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray XT5 Platorms Hongzhang Shan, Berkeley Lab/NERSC Haoqiang Jin NASA Arms Research Center Karl Fuerlinger U.C. Berkeley Alice Koniges, Nicholas J. Wright NERSC

Despite continued “packing” of transistors, performance is flatlining • New Constraints – 15 years of exponential clock rate growth has ended • But Moore’s Law continues! – How do we use all of those transistors to keep performance increasing at historical rates? – Industry Response: #cores per chip doubles every 18 months instead of clock frequency! Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith CUG 2010 2

Computer Centers and Vendors are Responding with Multi-core Designs Greyhound Greyhound Greyhound Greyhound HT3 DDR3 Channel DDR3 Channel 6MB L3 6MB L3 Greyhound Greyhound Cache Greyhound Cache Greyhound Greyhound Greyhound Greyhound Greyhound DDR3 Channel DDR3 Channel HT3 HT3 H HT T 3 Greyhound Greyhound 3 Greyhound Greyhound DDR3 Channel DDR3 Channel Greyhound Greyhound 6MB L3 6MB L3 Greyhound Greyhound Cache Cache Greyhound Greyhound HT3 Greyhound Greyhound DDR3 Channel DDR3 Channel To Interconnect HT1 / HT3 • Baker Node Details : 24-core “Magny Cours” • 2 Multi-Chip Modules, 4 Opteron Dies • 8 Channels of DDR3 Bandwidth to 8 DIMMs • 24 (or 16) Computational Cores, 24 MB of L3 cache CUG 2010 3

What’s Wrong with MPI with Multi-core • We can run 1 MPI process per core (flat model for parallelism) – This works now and will work for a while – But this is wasteful of intra-chip latency and bandwidth (100x lower latency and 100x higher bandwidth on chip than off-chip) – Model has diverged from reality (the machine is NOT flat) • How long will it continue working? – 4 - 8 cores? Probably. 128 - 1024 cores? Probably not. – Depends on performance expectations • What is the problem? – Latency: some copying required by semantics – Memory utilization: partitioning data for separate address space requires some replication  How big is your per core subgrid? At 10x10x10, over 1/2 of the points are surface points, probably replicated – Memory bandwidth: extra state means extra bandwidth – Weak scaling: success model for the “cluster era;” will not be for the many core era -- not enough memory per core – Heterogeneity: MPI per CUDA thread-block? CUG 2010 4

Changing Programming Models to Accommodate the Multi-core Revolution • We Need to research on other programming models, understand their advantages and disadvantages – OpenMP – UPC – Hybrid MPI+OpenMP – Etc. • Our Work is focus on Cray XT5 CUG 2010 5

Outline • Quantify Memory Usage for Different Programming Models • Using Detailed Time Breakdown to Investigate Performance Effects of Different Programming Models • Compare the Performance of Hopper and Jaguar to evaluate the hex-core and quad-core difference • Conclusion and Future Work CUG 2010 6

Memory Usage : OpenMP, UPC, MPI • MPI uses most memory, UPC uses slightly less • OpenMP saves great due to direct data access CUG 2010 7

Memory Usage : MPI+OpenMP • Using more OpenMP threads could reduce the memory usage substantially, up to five times on Hopper (eight-core nodes) CUG 2010 8

Performance: Using One Node on Hopper • Similar performance for CG, EP, LU, MG • For FT, IS, OpenMP delivers significantly better performance due to efficient programming CUG 2010 9

Performance: MPI vs. UPC • UPC performs better for EP and IS, close to CG, and worse for others CUG 2010 10

Time Breakdown: MPI vs. UPC • For LU, the longer communication time for UPC is probably due to lack o efficient point-to-point synchronization • For IS, the one-sided upc_memget/upc_memput is much more efficient than the MPI_alltoallv function CUG 2010 11

Performance: BT-MZ (MPI+OpenMP) • MPI suffers loan imbalance for higher number of MPI tasks • Best performance obtained when OpenMP=2 CUG 2010 12

Performance: SP-MZ (MPI+OpenMP) • Time is dominated by OpenMP • Performance scales well • Best performance obtained when OpenMP=2 CUG 2010 13

Performance: LU-MZ (MPI+OpenMP) • Best performance obtained when OpenMP=8 due to larger cache size and enough work in OpenMP region to amortize the OpenMP overhead CUG 2010 14

Jaguar vs. Hopper: Single Node • Using 8 cores on Jaguar, deliver similar performance • Using 12 cores on Jaguar: – EP 1.6 times better due to higher CPU frequency – CG, IS, better performance due to larger aggregate cache size – MG, SP worse performance due to memory contention CUG 2010 15

Jaguar vs. Hopper: MPI Across Nodes • EP, computation intensive application, consistently better • IS performs worst due to higher communication contention CUG 2010 16

Jaguar vs. Hopper: Time Breakdown for MPI on 1024 Cores • Computation time similar between Jaguar and Hopper • Communication time higher on jaguar except EP CUG 2010 17

Jaguar vs. Hopper: Hybrid (MPI+OpenMP) • For BT-MZ, similar performance • For SP-MZ, Jaguar is worse due to higher network contention • Using more OpenMP threads could reduce the performance gap CUG 2010 18

Conclusion and Future Work (1) • Memory Usage – MPI consumes most, UPC is slightly less, OpenMP saves greatly – Using more OpenMP threads could save up to several times amount of memory usage for MPI+OpenMP hybrid model • Performance – On single node, OpenMP performs best due to its efficient programming and direct data access – Across nodes, overall, UPC performs slightly worse now, but delivers much better performance for IS, the communication intensive application – Hybrid MPI+OpenMP codes in favor of using more OpenMP threads, the best performance depends on the tradeoff between OpenMP overhead and larger cache effects CUG 2010 19

Conclusion and Future Work (2) • Jaguar vs. Hopper – Using hex-core may cause more memory contention, slowing down the performance – Using hex-core may cause more network contention, degrading the performance, hurting the scalability – Only favors computation intensive applications, such as EP. • Future Work – Examine on larger node architectures – New Programming Models or MPI + x or ? CUG 2010 20

Analyzing the Effect of Different Programming Models Upon - PowerPoint PPT Presentation

Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray XT5 Platorms Hongzhang Shan, Berkeley Lab/NERSC Haoqiang Jin NASA Arms Research Center Karl Fuerlinger U.C. Berkeley Alice Koniges, Nicholas J.

PerfMon redux: analyzing a CUDA application with the Windows PerfMon redux: analyzing a CUDA

What are survey weights? Kelly McConville Assistant Professor of Statistics DataCamp Analyzing

Understanding Census geography and tigris basics Kyle Walker Instructor DataCamp Analyzing US

Twitter Networks Alex Hanna Computational Social Scientist DataCamp Analyzing Social Media Data

Quantum Hall effect effect Quantum Hall integer integer Hall bar geometry classical quantum

Spin Hall Effect and Experimental Observation 1701110147@pku.edu.cn 2017.12.15

The Greenhouse Effect The Greenhouse Effect Solar and terrestrial radiation occupy different

Ana Analyzing t g the he Effect cts o of Di Different S Signs gns to Incr ncrea ease t

From 2 days to 2 seconds - the birth of DevOps Dan North @tastapod Once upon a time

Wayne Snyder Computer Science Department Boston University Today: Analyzing Rhythm Analyzing

Analyzing with P Analyzing with P Delta y y g g Delta Presenter: Presenter: Deborah

Wireshark network analyzing tool 19/03/2018 1 Wireshark network analyzing tool What?

NGSS PRACTICE: ANALYZING AND INTERPRETING DATA SOUTHERN CT STATE UNIVERSITY ANALYZING DATA JULY

Analyzing the Market and Analyzing the Market and Determining Economic Feasibility Determining

Randomization methods Tamuno Alfred, PhD Biostatistician DataCamp Designing and Analyzing

Regression analysis Tamuno Alfred, PhD Biostatistician DataCamp Designing and Analyzing

virtual memory 5 / devices 1 last time page replacement metrics optimizing hit rate really

Last Class: Introduction to Operating Systems User apps Virtual machine interface OS physical

CPSC 121: Models of Computation able to: Specify the overall architecture of a (Von Neumann)

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Chapter 5 Internal Memory Contents Semiconductor main memory Organization DRAM and

Accounting for the Eect of Health on Economic Growth by David Weil (2006) September 2007 ()

Marriage Age, Social Status and Intergenerational Effects in Uganda Naveen Sunder UNU-WIDER

Generalized linear models Christopher F Baum EC 823: Applied Econometrics Boston College, Spring