Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. - - PowerPoint PPT Presentation

environment cle
SMART_READER_LITE
LIVE PREVIEW

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. - - PowerPoint PPT Presentation

A Micro-Benchmark Evaluation of Catamount and Cray Linux Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com> <kuehn@ornl.gov> Does CLE waddle like a penguin, or run like a catamount? THE BIG


slide-1
SLIDE 1

A Micro-Benchmark Evaluation

  • f Catamount and Cray Linux

Environment (CLE) Performance

Jeff Larkin Cray Inc. <larkin@cray.com> Jeff Kuehn ORNL <kuehn@ornl.gov>

slide-2
SLIDE 2

THE BIG QUESTION!

Does CLE waddle like a penguin, or run like a catamount?

2 CUG2008

slide-3
SLIDE 3

Overview

Background

Motivation Catamount and CLE Benchmarks Benchmark System

Benchmark Results

IMB HPCC

Conclusions

3 CUG2008

slide-4
SLIDE 4

BACKGROUND

4 CUG2008

slide-5
SLIDE 5

Motivation

Last year at CUG “CNL” was in its infancy Since CUG07

Significant effort spent scaling on large machines CNL reached GA status in Fall 2007 Compute Node Linux (CNL) renamed Cray Linux Environment (CLE) A significant number of sites have already made the change Many codes have already ported from Catamount to CLE

Catamount scalability has always been touted, so how does CLE compare?

Fundamentals of communication performance HPCC IMB

What should sites/users know before they switch?

5 CUG2008

slide-6
SLIDE 6

Background: Catamount

Developed by Sandia for Red Storm Adopted by Cray for the XT3 Extremely light weight

Simple Memory Model No Virtual Memory No mmap Reduced System Calls Single Threaded No Unix Sockets No dynamic libraries Few Interrupts to user codes

Virtual Node (VN) mode added for Dual-Core

6 CUG2008

slide-7
SLIDE 7

Background: CLE

First, we tried a full SUSE Linux Kernel. Then, we “put Linux on a diet.” With the help of ORNL and NERSC, we began running at large scale By Fall 2007, we released Linux for the compute nodes What did we gain?

Threading Unix Sockets I/O Buffering

7 CUG2008

slide-8
SLIDE 8

Background: Benchmarks

HPCC

Suite of several benchmarks, released as part of DARPA HPCS program MPI performance Performance for varied temporal and spatial localities Benchmarks are run in 3 modes SP – 1 node runs the benchmark EP – Every node runs a copy of the same benchmark Global – All nodes run benchmark together

Intel MPI Benchmarks (IMB) 3.0

Formerly Pallas benchmarks Benchmarks standard MPI routines at varying scales and message sizes

8 CUG2008

slide-9
SLIDE 9

Background: Benchmark System

All benchmarks were run on the same system, “Shark,” and with the latest OS versions as of Spring 2008 System Basics

Cray XT4 2.6 GHz Dual-Core Opterons (Able to run to 1280 Cores) DDR2-667 Memory, 2GB/core

Catamount (1.5.61) CLE, MPT2 (2.0.50) CLE, MPT3 (2.0.50, xt-mpt 3.0.0.10)

9 CUG2008

slide-10
SLIDE 10

BENCHMARK RESULTS

10 CUG2008

slide-11
SLIDE 11

HPCC

11 CUG2008

slide-12
SLIDE 12

Parallel Transpose (Cores)

20 40 60 80 100 120 140 500 1000 1500 GB/s Processor Cores Catamount SN Catamount VN CLE MPT2 N1 CLE MPT2 N2 CLE MPT3 N1 CLE MPT3 N2

12 CUG2008

slide-13
SLIDE 13

Parallel Transpose (Sockets)

20 40 60 80 100 120 100 200 300 400 500 600 GB/s Sockets Catamount SN Catamount VN CLE MPT2 N1 CLE MPT2 N2 CLE MPT3 N1 CLE MPT3 N2

13 CUG2008

slide-14
SLIDE 14

MPI Random Access

0.5 1 1.5 2 2.5 3 500 1000 1500 GUP/s Processor Cores Catamount SN Catamount VN CLE MPT2 N1 CLE MPT2 N2 CLE MPT3 N1 CLE MPT3 N2

14 CUG2008

slide-15
SLIDE 15

MPI-FFT (cores)

50 100 150 200 250 200 400 600 800 1000 1200 GFlops/s Processor Cores Catamount SN Catamount VN CLE MPT2 N1 CLE MPT2 N2 CLE MPT3 N1 CLE MPT3 N2

15 CUG2008

slide-16
SLIDE 16

MPI-FFT (Sockets)

50 100 150 200 250 100 200 300 400 500 600 GFlops/s Sockets Catamount SN Catamount VN CLE MPT2 N1 CLE MPT2 N2 CLE MPT3 N1 CLE MPT3 N2

16 CUG2008

slide-17
SLIDE 17

Naturally Ordered Latency

512 Catamount SN 6.41346 CLE MPT2 N1 9.08375 CLE MPT3 N1 9.41753 Catamount VN 12.3024 CLE MPT2 N2 13.8044 CLE MPT3 N2 9.799 2 4 6 8 10 12 14 16 Time (usec) 17 CUG2008

slide-18
SLIDE 18

Naturally Ordered Bandwidth

512 Catamount SN 1.07688 CLE MPT2 N1 0.900693 CLE MPT3 N1 0.81866 Catamount VN 0.171141 CLE MPT2 N2 0.197301 CLE MPT3 N2 0.329071 0.2 0.4 0.6 0.8 1 1.2 MB/s 18 CUG2008

slide-19
SLIDE 19

IMB

19 CUG2008

slide-20
SLIDE 20

IMB Ping Pong Latency (N1)

2 4 6 8 10 12 200 400 600 800 1000 1200 Time (usec) Message Size (B) Catamount CLE MPT2 CLE MPT3

20 CUG2008

slide-21
SLIDE 21

IMB Ping Pong Latency (N2)

1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 Avg uSec Bytes Catamount CLE MPT2 CLE MPT3

21 CUG2008

slide-22
SLIDE 22

IMB Ping Pong Bandwidth

100 200 300 400 500 600 200 400 600 800 1000 1200 MB/s Message Size (Bytes) Catamount CLE MPT2 CLE MPT3

22 CUG2008

slide-23
SLIDE 23

MPI Barrier (Lin/Lin)

20 40 60 80 100 120 140 160 500 1000 1500 Time (usec) Processor Cores Catamount CLE MPT2 CLE MPT3

23 CUG2008

slide-24
SLIDE 24

MPI Barrier (Lin/Log)

20 40 60 80 100 120 140 160 1 10 100 1000 10000 Time (usec) Processor Cores Catamount CLE MPT2 CLE MPT3

24 CUG2008

slide-25
SLIDE 25

MPI Barrier (Log/Log)

0.1 1 10 100 1000 1 10 100 1000 10000 Time (usec) Processor Cores Catamount CLE MPT2 CLE MPT3

25 CUG2008

slide-26
SLIDE 26

SendRecv (Catamount/CLE MPT2)

26 CUG2008

slide-27
SLIDE 27

SendRecv (Catamount/CLE MPT3)

27 CUG2008

slide-28
SLIDE 28

Broadcast (Catamount/CLE MPT2)

28 CUG2008

slide-29
SLIDE 29

Broadcast (Catamount/CLE MPT3)

29 CUG2008

slide-30
SLIDE 30

Allreduce (Catamount/CLE MPT2)

30 CUG2008

slide-31
SLIDE 31

Allreduce (Catamount/CLE MPT3)

31 CUG2008

slide-32
SLIDE 32

AlltoAll (Catamount/CLE MPT2)

32 CUG2008

slide-33
SLIDE 33

AlltoAll (Catamount/CLE MPT3)

33 CUG2008

slide-34
SLIDE 34

CONCLUSIONS

34 CUG2008

slide-35
SLIDE 35

What we saw

Catamount Handles Single Core (SN/N1) Runs slightly better Seems to handle small messages and small core counts slightly better CLE Does very well on dual- core Likes large messages and large core counts MPT3 helps performance and closes the gap between QK and CLE

35 CUG2008

slide-36
SLIDE 36

What’s left to do?

We’d really like to try this again on a larger machine

Does CLE continue to beat Catamount above 1024, or will the lines converge or cross?

What about I/O?

Linux adds I/O buffering, how does this affect I/O performance at scale?

How does this translate into application performance?

See "Cray XT4 Quadcore: A First Look", Richard Barrett, et.al., Oak Ridge National Laboratory (ORNL)

36 CUG2008

slide-37
SLIDE 37

CLE RUNS LIKE A BIG CAT!

Does CLE waddle like a penguin, or run like a catamount?

37 CUG2008

slide-38
SLIDE 38

Acknowledgements

This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05- 00OR22725. Thanks to Steve, Norm, Howard, and others for help investigating and understanding these results

CUG2008 38