Thomas B. Rolinger , Christopher D. Krieger SC 2018 1 Outline 1. - PowerPoint PPT Presentation

Impact of Traditional Sparse Optimizations on a Migratory Thread Architecture Thomas B. Rolinger , Christopher D. Krieger SC 2018

1 Outline 1. Motivation 2. Emu Architecture 3. SpMV Optimizations 4. Experiments and Results 5. Conclusions & Future Work

2 1.) Motivation

3 1.) Motivation • Sparse linear algebra kernels – Present in many scientific/big-data applications – Achieving high performance is difficult • irregular access patterns and weak locality – Most approaches target today’s architectures: deep - memory hierarchies, GPUs, etc. • Novel architectures for sparse applications – Emu: light-weight migratory threads, narrow memory, near-memory processing • Our work – Study impact of existing optimizations for sparse algorithms on Emu versus cache-memory based systems – Target algorithm: Sparse Matrix-Vector Multiply ( SpMV )

3 1.) Motivation • Sparse linear algebra kernels – Present in many scientific/big-data applications – Achieving high performance is difficult • irregular access patterns and weak locality – Most approaches target today’s architectures: deep - memory hierarchies, GPUs, etc. • Novel architectures for sparse applications – Emu: light-weight migratory threads, narrow memory, near-memory processing • Our work – Study impact of existing optimizations for sparse algorithms on Emu versus cache-memory based systems – Target algorithm: Sparse Matrix-Vector Multiply ( SpMV ) • Compressed Sparse Row ( CSR )

4 2.) Emu Architecture

5 2.) Emu Architecture • Gossamer Core (GC) – general purpose, cache-less – supports up to 64 concurrent light- weight threads • Narrow Memory – eight 8-bit channels rather than a single, wider 64-bit interface • Memory-side Processor – executes atomic and remote operations System used in our work: 8 nodelets with 1 GC per nodelet (150MHz) – remote ops do not 8GB DDR4 1600MHz per nodelet generate migrations 64 threads per nodelet (512 total)

5 2.) Emu Architecture • Gossamer Core (GC) – general purpose, cache-less – supports up to 64 concurrent light- weight threads • Narrow Memory – eight 8-bit channels rather than a single, wider 64-bit interface • Memory-side Processor – executes atomic and remote operations System used in our work: 1 node: 8 nodelets with 1 GC per nodelet (150MHz) – remote ops do not 8GB DDR4 1600MHz per nodelet generate migrations 64 threads per nodelet (512 total) 12

6 2.) Emu Architecture: Migrations 1.) Thread on GC issues remote mem access 2.) GC makes request to NQM to migrate thread 3.) Thread moved into migration queue 4.) Thread sent over ME once accepted by NQM

6 2.) Emu Architecture: Migrations 1.) Thread on GC issues remote mem access 2.) GC makes request to NQM to migrate thread 3.) Thread moved into migration queue

6 2.) Emu Architecture: Migrations 1.) Thread on GC issues remote mem access 2.) GC makes request to NQM to migrate thread 3.) Thread moved into migration queue 4.) Thread sent over ME once accepted by NQM

6 2.) Emu Architecture: Migrations 1.) Thread on GC issues remote mem access 2.) GC makes request to NQM to migrate thread 3.) Thread moved into migration queue 4.) Thread sent over ME once accepted by NQM 5.) Thread arrives in dest run queue and waits for available register set on a GC

7 2.) Emu Architecture: Migrations 1.) Thread on GC issues remote mem access 2.) GC makes request to NQM to migrate thread 3.) Thread moved into migration queue 4.) Thread sent over ME once accepted by NQM Thread Context: Roughly 200 bytes (PC, 5.) Thread arrives in dest run queue and waits for available register set on a GC registers, stack counter, etc.) Migration Cost: ~2x more than a local access

7 3.) SpMV Optimizations

8 3.) SpMV Optimizations: Vector Data Layout • Updating b may require remote writes – non-zeros on row i are all assigned to a single thread  b [ i ] accumulated in register and then updated via single remote write (or local write) • SpMV requires one load from x per non-zero – each access may generate migration  layout of x is crucial to performance • Cyclic and Block layouts – Cyclic : adjacent elements of vector are on different nodelets (round-robin)  consecutive accesses require migrations – Block : equally divide the vectors into fixed-size blocks and place 1 block on each nodelet

9 3.) SpMV Optimizations: Work Distribution

9 3.) SpMV Optimizations: Work Distribution NDLT 0 NDLT 1 NDLT 2 NDLT 3 NDLT 4 NDLT 5 NDLT 6 NDLT 7 • Row based – evenly distribute rows – block size of b == # rows per nodelet – may assign unequal # of non- zeros to each nodelet

9 3.) SpMV Optimizations: Work Distribution b NDLT 0 NDLT 1 NDLT 2 NDLT 3 NDLT 4 NDLT 5 NDLT 6 NDLT 7 • Row based – evenly distribute rows – block size of b == # rows per nodelet – may assign unequal # of non- zeros to each nodelet

9 3.) SpMV Optimizations: Work Distribution b b NDLT 0 NDLT 0 NDLT 1 NDLT 1 NDLT 2 NDLT 2 NDLT 3 NDLT 3 NDLT 4 NDLT 4 NDLT 5 NDLT 5 NDLT 6 NDLT 6 NDLT 7 NDLT 7 • Row based • Non-zero based – evenly distribute rows – “evenly” distribute non - zeros – block size of b == # rows – may assign unequal # of per nodelet rows to each nodelet – may assign unequal # of • remote writes may be non-zeros to each nodelet required for b

10 4.) Experiments and Results

11 4.) Experiments: Matrices • Evaluated SpMV across 40 matrices – Following results focus on a representative subset – RMAT graph produced with a=0.45, b=0.22, c=0.22 – All matrices are square – Non-symmetric denoted with “*”, symmetric matrices stored in their entirety

Thomas B. Rolinger , Christopher D. Krieger SC 2018 1 Outline 1. - PowerPoint PPT Presentation

Impact of Traditional Sparse Optimizations on a Migratory Thread Architecture Thomas B. Rolinger , Christopher D. Krieger SC 2018 1 Outline 1. Motivation 2. Emu Architecture 3. SpMV Optimizations 4. Experiments and Results 5. Conclusions

Parallel Sparse Tensor Decomposition in Chapel Thomas B. Rolinger , Tyler A. Simon, Christopher

City of Piedmont Best Best & Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best

How to Develop a Strategic Plan Managing Performance Remotely Jennifer Amstutz Alan Krieger

How to Develop Managing and Engaging Teams a Strategic Plan Remotely Jennifer Amstutz Alan Krieger

How to Develop Financial Sustainability a Strategic Plan Jennifer Amstutz Alan Krieger

Center Manifolds and Hamiltonian Evolution Equations J. Krieger (EPF Lausanne) K. Nakanishi

Sir Christopher Gent Sir Christopher Gent Chief Executive Chief Executive Vodafone Group Plc

Christopher Fry (1907 2005) Biography Christopher Fry, originally Christopher Harris, was

Tomas Viaduct Middle School Thomas Viaduct Middle School tca tca architects THOMAS THOMAS

De 0.01 3.0 20 ans de Linux Thomas Petazzoni Thomas Petazzoni Linux embarqu Thomas

Community Choice Aggregation Considerations for Public Agencies Ryan Baron, Of Counsel City of

The use of genomics to understand human disease Jonathan Pevsner, Ph.D. Kennedy Krieger

Extending OCL Operation Contracts with Objective Functions Matthias P. Krieger 1 Achim D. Brucker

NEURODIVERSITY: New Frontiers in Workforce Talent Management Presented by Kennedy Krieger

detector at the CAST experiment Christoph Krieger University of Bonn On behalf of the CAST

Jennifer Zarcone Kennedy Krieger Institute and Johns Hopkins University School of Medicine 1

Advan ancing L Log ogistics s in a a Digi gital al A Age National Defense Transportation

Analysis of pure methods using Garbage Collec8on Authors:

Wells - Soultz Study Jiri Muller, K. Bilkova, M. Seiersten jiri@ife.no Materials and Corrosion

Retrofitting a Concurrent GC onto OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge

3

Multi-Domain VPN service, a seamless infrastructure for Regional Network, NRENs and GEANT JRES

The New DANTE NO OC A Multiple Domain O A Multiple Domain O Ops Centre Ops Centre Toby

Sonoma State University Engineering Industry Advisory Board Meeting May 1, 2020 12 - 2 PM Via

Thomas B. Rolinger , Christopher D. Krieger SC 2018 1 Outline 1. - PowerPoint PPT Presentation

Impact of Traditional Sparse Optimizations on a Migratory Thread Architecture Thomas B. Rolinger , Christopher D. Krieger SC 2018 1 Outline 1. Motivation 2. Emu Architecture 3. SpMV Optimizations 4. Experiments and Results 5. Conclusions

Parallel Sparse Tensor Decomposition in Chapel Thomas B. Rolinger , Tyler A. Simon, Christopher

City of Piedmont Best Best &amp; Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best

How to Develop a Strategic Plan Managing Performance Remotely Jennifer Amstutz Alan Krieger

How to Develop Managing and Engaging Teams a Strategic Plan Remotely Jennifer Amstutz Alan Krieger

How to Develop Financial Sustainability a Strategic Plan Jennifer Amstutz Alan Krieger

Center Manifolds and Hamiltonian Evolution Equations J. Krieger (EPF Lausanne) K. Nakanishi

Sir Christopher Gent Sir Christopher Gent Chief Executive Chief Executive Vodafone Group Plc

Christopher Fry (1907 2005) Biography Christopher Fry, originally Christopher Harris, was

Tomas Viaduct Middle School Thomas Viaduct Middle School tca tca architects THOMAS THOMAS

De 0.01 3.0 20 ans de Linux Thomas Petazzoni Thomas Petazzoni Linux embarqu Thomas

Community Choice Aggregation Considerations for Public Agencies Ryan Baron, Of Counsel City of

The use of genomics to understand human disease Jonathan Pevsner, Ph.D. Kennedy Krieger

Extending OCL Operation Contracts with Objective Functions Matthias P. Krieger 1 Achim D. Brucker

NEURODIVERSITY: New Frontiers in Workforce Talent Management Presented by Kennedy Krieger

detector at the CAST experiment Christoph Krieger University of Bonn On behalf of the CAST

Jennifer Zarcone Kennedy Krieger Institute and Johns Hopkins University School of Medicine 1

Advan ancing L Log ogistics s in a a Digi gital al A Age National Defense Transportation

Analysis of pure methods using Garbage Collec8on Authors:

Wells - Soultz Study Jiri Muller, K. Bilkova, M. Seiersten jiri@ife.no Materials and Corrosion

Retrofitting a Concurrent GC onto OCaml KC Sivaramakrishnan University of OCaml Labs Cambridge

3

Multi-Domain VPN service, a seamless infrastructure for Regional Network, NRENs and GEANT JRES

The New DANTE NO OC A Multiple Domain O A Multiple Domain O Ops Centre Ops Centre Toby

Sonoma State University Engineering Industry Advisory Board Meeting May 1, 2020 12 - 2 PM Via

City of Piedmont Best Best & Krieger Company/BestBestKrieger @BBKlaw 2018 Best Best