Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda - PowerPoint PPT Presentation

Introducing the Cray XMT Petr Konecny November 29 th 2007

Agenda � Shared memory programming model • Benefits/challenges/solutions � Origins of the Cray XMT � Cray XMT system architecture • Cray XT infrastructure • Cray Threadstorm processor � Basic programming environment features � Examples • HPCC Random Access • Breadth first search � Rules of thumb � Summary November 07 Slide 2

Shared memory model � Benefits • Uniform memory access • Memory is distributed across all nodes • No (need for) explicit message passing • Productivity advantage over MPI � Challenges • Latency: time for a single operation • Network bandwidth limits performance • Legacy MPI codes November 07 Slide 3

Addressing shared memory challenges � Latency • Little’s law: � Parallelism is necessary ! � Concurrency = Bandwidth * Latency � e.g.: 800 MB/s, 2 μ s latency => 200 concurrent 64-bit word ops • Need a lot of concurrency to maximize bandwidth � Concurrency per thread (ILP, vector, SSE) => SPMD � Many threads (MTA, XMT) => MPMD � Network Bandwidth • Provision lots of bandwidth � ~1 GB/s per processor, ~5 GB/s per router on XMT • Efficient for small messages • Software controlled caching (registers, nearby memory) � Eliminates cache coherency traffic � Reduces network bandwidth November 07 Slide 4

Origins of the Cray XMT Cray XMT (a.k.a. Eldorado) Upgrade Opteron to Threadstorm Multithreaded Architecture (MTA) Cray XT Infrastructure Shared memory programming model Scalable Thread level parallelism I/O, HSS, Support Lightweight synchronization Network efficient for small messages November 07 Slide 5

Cray XMT System Architecture Service Partition • Linux OS Compute Service & IO • Specialized Linux nodes MTK Login PEs Linux IO Server PEs Network Server PEs FS Metadata Server PEs Database Server PEs Compute Partition MTK (BSD) PCI-X 10 GigE Network PCI-X Fiber Channel RAID Controllers November 07 Slide 6

Cray XMT Speeds and feeds Threadstorm 500M instructions/s ASIC 66M cache lines/s 500M memory op/s 4,8 or 16 GB DDR DRAM 500M memory op/s 110M → 30M memory op/s (1 → 4K processors); 140M memory op/s bisection bandwidth impact November 07 Slide 7

Cray Threadstorm architecture � Streams (128 per processor) • Registers, program counter, other state � Protection domain (16 per processor) • Provides address space • Each running stream belongs to exactly one protection domain � Functional units • Memory • Arithmetic • Control � Memory buffer (cache) • Only store data of the DIMMs attached to the processor • Never cache remote data (no coherency traffic) • All requests go through the buffer • 128 KB, 4-way associative, 64 byte cache lines November 07 Slide 8

XMT Programming Environment supports multithreading � Flat distributed shared memory! � Rely on the parallelizing compilers • They do great with loop level parallelism � Many computations need to be restructured • To expose parallelism • For thread safety � Light-weight threading • Full/empty bit on every word � writeef / readfe / readff / writeff • Compact thread state • Low thread overhead • Low synchronization overhead • Futures (see LISP) � Performance tools • Apprentice2 – parse compiler annotations, visualize runtime behavior November 07 Slide 9

HPCC Random Access � Update a large table based on a random number generator � NEXTRND returns next value of RNG unsigned rnd = 1; for(i=0; i<NUPDATE; i++) { rnd = NEXTRND(rnd); Table[rnd&(size-1)] ^= rnd; } � HPCC_starts(k) returns k-th value of RNG for(i=0; i<NUPDATE; i++) { unsigned rnd = HPCC_starts(i); Table[rnd&(size-1)] ^= rnd; } � Compiler can automatically parallelize this loop � It generates readfe / writeef for atomicity November 07 Slide 10

HPCC Random Access - tuning � HPCC_starts is expensive � Restructure loop to amortize cost for(i=0; i<NUPDATE; i+=bigstep) { unsigned v = HPCC_starts(i); for(j=0;j<bigstep;j++) { v = NEXTRND(v); Table[(v&(size-1)] ^= v; } } � The compiler parallelizes outer loop across all processors � Apprentice2 reports • Five instructions per update (includes NEXTRND) • Two (synchronized) memory operations per update November 07 Slide 11

HPCC Random Access - performance � Performance analysis • Each update requires a read from and a write to a DIMM • Peak of 66 M cachelines/s/processor => • Peak of 33 M updates/s/processor � Single processor performance • Measured 20.9 M updates/s � On 64 CPU preproduction system • Measured 1.28 Gup/s � 95% scaling efficiency from 1P to 64P November 07 Slide 12

Breadth first search � Algorithm to find shortest path tree in unweighted graph Parent[*] = null Enqueue(source) Parent[source] = source While queue not empty: For all u already in queue: Dequeue(u) For all neighbors v of u: If Parent[v] is null: Parent[v] = u Enqueue(v) November 07 Slide 13

Breadth first search � An algorithm to find shortest path tree in unweighted graph ← parallel parent[*] = null enqueue(source) parent[source] = source while queue not empty: ← serial ← for all u already in queue: parallel dequeue(u) for all neighbors v of u: ← possibly parallel if Parent[v] is null: ← atomic (readfe) parent[v] = u ← writeef enqueue(v) November 07 Slide 14

Breadth first search - queue � Each vertex can be enqueued at most once � Use an array of size |V| with head and tail pointers oldtail = tail; oldhead = head; head = tail; #pragma mta assert parallel for(int i = oldhead; i<oldtail; i++) { Node u = Queue[i]; … } November 07 Slide 15

Breadth first search – tuning and performance � Tune on sparse Erdös-Rényi graphs � Reduce overhead of queue operations � Eliminate contention for queue tail pointer � Performance counters show: • 2 memory operations/edge • 8.45 memory operations/vertex � 32p system • 1 billion nodes/10 billion edges: ~17s � 128p system • 4 billion nodes/40 billion edges: ~20s November 07 Slide 16

Performance – rules of thumb � Instructions are cheap compared to memory ops � Most workloads will be limited by bandwidth � Keep enough memory operations in flight at all times � Load balancing � Minimize synchronization � Use moderately cache friendly algorithms � Cache hits are not necessary to hide latency � Cache can improve effective bandwidth � ~40% cache hit rate for distributed memory � ~80% cache hit rate for nearby memory � Reduce cache footprint � Be careful about speculative loads (bandwidth is scarce) � Think of XMT as a lot of processors running at 1 MHz November 07 Slide 17

Traits of strong Cray XMT applications 1. Use lots of memory • Cray XMT supports terabytes 2. Lots of parallelism • Amdahl’s law • Parallelizing compiler 3. Fine granularity of memory access • Network is efficient for all (including short) packets 4. Data hard to partition • Uniform shared memory alleviates the need to partition 5. Difficult load balancing • Uniform shared memory enables work migration November 07 Slide 18

Summary � Shared memory programming is good for productivity � Cray XMT adds value for an important class of problems • Terabytes of memory • Irregular access with small granularity • Lots of parallelism exploitable by programming environment � Working on scaling the system November 07 Slide 19

Future example: Tree search struct Tree { struct Tree { Declare a future variable. All Tree *llink; Tree *llink; Create a continuation based on loads are readff() . All stores Return the result in the future the future variable left$ . Set Tree *rlink; Tree *rlink; are writeff() . variable left$ . Set left$ to Wait for left$ to be full before to empty. left$ int data; int data; full. adding it to the sum. }; }; int search_tree(Tree *root, int target) { int search_tree(Tree *root, int target) { int sum = 0; int sum = 0; if (root) { if (root) { future int left$; future left$(root, target) { return search_tree(root->llink, target); } sum = (root->data == target ? 1 : 0); sum = (root->data == target ? 1 : 0); sum += search_tree(root->rlink, target); sum += search_tree(root->rlink, target); sum += left$; sum += search_tree(root->llink, target); } } return sum; return sum; } } November 07 Slide 20

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda - PowerPoint PPT Presentation

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming model Benefits/challenges/solutions Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm

Cray XMT Scalable, multithreaded, shared memory machine Designed for single word

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

Introducing more people Introducing more people Introducing more people Introducing more people

ALPS Tutorial Ascent Michael Karo mek@cray.com Topics A look back at Base Camp

Diagnostic Capabilities of the Red Storm Compliance Test Suite Mike Davis Cray Inc.

Pyrrolidine analogs of arylceramide HPA-12 22nd International Electronic Conference on

Policy Recommendations Chuck Bell Programs Director Consumers Union Affordability Really

NETWORK GROUP DISCOVERY BY HIERARCHICAL LABEL PROPAGATION Lovro Subelj & Marko Bajec

LLRF Tests in the FEL and CEBAF with the Cornell Digital LLRF System JLAB: C. Grenoble, K. Davis,

Impact of recent physics changes on IFS Impact of recent physics changes on IFS forecast

Vision Improve the Health Statement: A and Quality of Life Trusted Leader of the

Efficient UC-Secure Authenticated Key-Exchange for Algebraic Languages PKC 2013 , Fabrice Ben

Data Assimilation: Finding the Initial Conditions in Large Dynamical Systems Eric Kostelich

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda - PowerPoint PPT Presentation

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming model Benefits/challenges/solutions Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm

Cray XMT Scalable, multithreaded, shared memory machine Designed for single word

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Dave Strenski, Cray Inc. Cray User Group, Atlanta 5-5-09 Storaasli - MRSC - 29 M 07 3 FPGA

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL &lt;larkin@cray.com&gt;

Cray I/O Software Enhancements Tom Edwards tedwards@cray.com C O M P U T E | S T O R E

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

Introducing more people Introducing more people Introducing more people Introducing more people

ALPS Tutorial Ascent Michael Karo mek@cray.com Topics A look back at Base Camp

Diagnostic Capabilities of the Red Storm Compliance Test Suite Mike Davis Cray Inc.

Pyrrolidine analogs of arylceramide HPA-12 22nd International Electronic Conference on

Policy Recommendations Chuck Bell Programs Director Consumers Union Affordability Really

NETWORK GROUP DISCOVERY BY HIERARCHICAL LABEL PROPAGATION Lovro Subelj &amp; Marko Bajec

LLRF Tests in the FEL and CEBAF with the Cornell Digital LLRF System JLAB: C. Grenoble, K. Davis,

Impact of recent physics changes on IFS Impact of recent physics changes on IFS forecast

Vision Improve the Health Statement: A and Quality of Life Trusted Leader of the

Efficient UC-Secure Authenticated Key-Exchange for Algebraic Languages PKC 2013 , Fabrice Ben

Data Assimilation: Finding the Initial Conditions in Large Dynamical Systems Eric Kostelich

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

NETWORK GROUP DISCOVERY BY HIERARCHICAL LABEL PROPAGATION Lovro Subelj & Marko Bajec