Performance understanding Sara Karamati Je ff Young Kenneth (Kent) - PowerPoint PPT Presentation

hpcgarage.org/sppexa16 Performance understanding Sara Karamati · Je ff Young · Kenneth (Kent) Czechowski [ Main Street Hub ] · Jee Whan Choi [ IBM Research ] · Oded Green [Bader lab @ GT] · Marat Dukhan · Anita Zakrzewska [Bader lab @ GT] · Parsa Banihashmi & Ken Wills [Civil Engineering @ GT] · Richard (Rich) Vuduc January 26, 2016 — CATWALK session at the SPPEXA 2016 Symposium

“Life can only be understood backwards; but it must be lived forwards.” — Søren Kierkegaard

Main ideas of this talk: 1. Performance understanding can drive transformation(s).

Main ideas of this talk: 1. Performance understanding can drive transformation(s). 2. Transformation(s) can drive performance understanding!

Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power

Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Oded Green, Marat Dukhan, RV. “Branch-avoiding graph algorithms.” In SPAA’15 . + Post-paper analysis help from Anita Zakrzewska

Shiloach-Vishkin algorithm to compute connected components (as labels) forall v ∈ V do label[v] ← int(v) while … do forall v ∈ V do forall (v, u) ∈ E do if label[v] < label[u] then label[v] ← label[u] O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 .

Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 0.9 0.6 SV 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:

Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 Measured 0.9 0.6 SV 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:

Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 0.9 0.6 SV Modeled (counters + lasso regression) 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:

Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 0.9 0.6 SV 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:

Branch-based (original): Branch-avoiding: forall v ∈ V do forall v ∈ V do label[v] ← int(v) label[v] ← int(v) while … do while … do forall v ∈ V do forall v ∈ V do forall (v, u) ∈ E do forall (v, u) ∈ E do if label[u] < label[v] then flag ← (label[u] < label[v]) label[v] ← label[u] cmov (label[v], label[u], flag) O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 .

Shiloach − Vishkin Connected Components: Cycles [Normalized to branch − based minimum] ldoor ● 1.12x Branch-based ● ● ● ● ● ● ● 1.1 ● ● ● ● Ivy Bridge ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● Branch-avoiding 0 20 40 60 Iteration O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 . Branch − based Branch − avoiding a a Implementation ●

Branch-based (original): Branch-avoiding: forall v ∈ V do forall v ∈ V do label[v] ← int(v) label[v] ← int(v) while … do while … do forall v ∈ V do forall v ∈ V do forall (v, u) ∈ E do forall (v, u) ∈ E do if label[u] < label[v] then if label[u] < label[v] then flag ← (label[u] < label[v]) label[v] ← label[u] label[v] ← label[u] cmov (label[v], label[u], flag) O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 .

https://rjlipton.wordpress.com/2015/06/25/we-say-branches-and-you-say-choices/

Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power

Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power Sara Karamati, Je ff Young (PhD), R. Vuduc

Example: Single-Source Shortest Path (SSSP) on a GPU Sara Karamati, Je ff Young (PhD), R. Vuduc

๏ Consider two implementation variants * ๏ “Bellman-Ford-like” — Highly parallel but not work-optimal ๏ “Delta-stepping-like” — Tunable work-parallelism tradeo ff ๏ No preprocessing shortcuts, a la PHAST ** ๏ Both are tuned * for a GPU and we run them on an NVIDIA Jetson TK1, which has tunable core frequencies ( 10x ) and memory frequencies ( 3x ) * These are GunRock implementations of Davidson, Baxter, Garland, and Owens (IPDPS’14) ** Delling et al. “PHAST: Hardware-accelerated shortest path trees” (JPDC’10)

Parallelism (x 10 3 ) Iterations

Main ideas of this talk: 1. Performance understanding can drive transformation(s). 2. Transformation(s) can drive performance understanding!

Kent Czechowski Passive Active vs. (observational) (experimental)

Kent Czechowski Passive Active vs. (observational) (experimental) Many related ideas! Environmental modifiers: DVFS (h/w), Gremlins (s/w) Code modifiers : autotuning , stochastic (super)optimizers

Kent Czechowski Kent’s idea: Pressure point analysis (PPA) Iteratively rewrite the input program in a controlled fashion, then re-analyze it. Rewrites need not necessarily be semantics preserving!

PPA: Conceptual example PPA: C ONCEPTUAL E XAMPLE nop ! Tri-Diagonal Elimination ! nop ! for ( i=1 ; i<n ; i++ ) { ! vsubsd xmm0, xmm1, xmm0 ! ! x[i] = z[i]*( y[i] - x[i-1] ); ! vmulsd xmm3, xmm0, xmm10 ! Compute Only nop ! } vsubsd xmm4, xmm2, xmm3 ! vmulsd xmm0, xmm4, xmm12 ! vmovsd xmm1, [8+rsi+r12] ! nop vmovsd xmm2, [16+rsi+r12] ! vsubsd xmm0, xmm1, xmm0 ! vmulsd xmm3, xmm0, [8+rsi+rbp] ! vmovsd xmm1, [8+rsi+r12] ! vmovsd [8+rsi+r13], xmm3 ! vmovsd xmm2, [16+rsi+r12] ! vsubsd xmm4, xmm2, xmm3 ! nop ! vmulsd xmm0, xmm4, [16+rsi+rbp] ! vmovsd xmm3, [8+rsi+rbp] ! vmovsd [16+rsi+r13], xmm0 Memory ! vmovsd [8+rsi+r13], xmm3 ! Access Only nop ! vmovsd xmm0, [16+rsi+rbp] ! vmovsd [16+rsi+r13], xmm0 Transformations need not (necessarily) preserve correctness!

Can we account for all lost cycles? Automated battery of experiments ( ( • Frontend bottlenecks ! for ( k=0 ; k<n ; k++ ) { ! • Scheduling resource conflicts ! ! x[k] = u[k] + r*( z[k] + r*y[k] ) + ! • Data bypass delays ! ! ! t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) + ! • Cache latency stalls ! ! ! t*( u[k+6] + r*( u[k+5] + r*u[k+4] ) ) ); ! • Memory disambiguation conflicts ! } • Retirement bandwidth Kent’s progress: Micro-op caches, L1 bank conflicts, register scrambling, cache placement

Can we account for all lost cycles? Automated battery of experiments ( ( • Frontend bottlenecks ! for ( k=0 ; k<n ; k++ ) { ! • Scheduling resource conflicts ! ! x[k] = u[k] + r*( z[k] + r*y[k] ) + ! • Data bypass delays ! ! ! t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) + ! • Cache latency stalls ! ! ! t*( u[k+6] + r*( u[k+5] + r*u[k+4] ) ) ); ! • Memory disambiguation conflicts ! } • Retirement bandwidth Kent’s progress: Micro-op caches, L1 bank conflicts, register scrambling , cache placement

Performance understanding Sara Karamati Je ff Young Kenneth (Kent) - PowerPoint PPT Presentation

hpcgarage.org/sppexa16 Performance understanding Sara Karamati Je ff Young Kenneth (Kent) Czechowski [ Main Street Hub ] Jee Whan Choi [ IBM Research ] Oded Green [Bader lab @ GT] Marat Dukhan Anita Zakrzewska [Bader lab @ GT]

UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING

CS1063: Understanding CS1063: Understanding CS1063: Understanding CS1063: Understanding

Towards Understanding Towards Understanding Objectives Objectives Good basic understanding of

Understanding Business Expectations: Understanding Business Expectations: Understanding Business

Blended Analysis for Blended Analysis for Performance Understanding of Performance Understanding

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

2018 Understanding the status of NPE funding Understanding the changes to the ACIP Process

Understanding Others Understanding Others From Dots to Robots From Dots to Robots Ayse P.

Lunch n Learn Lunch n Learn Lunch n Learn Lunch n Learn Understanding Understanding

Understanding a Sites Traffic With Google Analytics UNDERSTANDING GOOGLE ANALYTICS Daniel

COMP31212: Concurrency Topics 2.3: Understanding FSP Topic 2.3: Understanding FSP Outline Topic

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

March 2019 CONTENTS Page Combined Partner Performance 1 Breckland Performance Reports 2-6

Performance Bas Performance Bas Performance Bas Performance Bas ed ed ed ed Methodology for

Verification Verification, Performance Performance Analysis Performance Performance Analysis

Staffing to Acuity GHCA Quality Committee Objectives Understanding Your PPD And The Hours You

rdpress By Amir Shokri Amirsh.nll@gmail.com History Of Wordpress WordPress was released on May

WordPress Amir Shokri [ amirsh.nll@gmail.com ] graduate of the software Engineering of

Micro-Visualizations Jonathon Storrick Jon.Storrick@gmail.com Center for Computational Analysis

+ + What is a word cloud? Word Clouds Source:

Kodi - Open Source Home Entertainment Software (formerly known as XBMC) Ejal de Klerk Martijn

Pimping Your Site Michael Anello DrupalEasy @ultimike Monday, February 14, 2011 Awesome

LET THE PROJECT FLOW Front end and PM Lessons learned from building a Drupal 8 website for one of

Welcome To !"#$%&'()&#*+,-.(/01+#$2

Performance understanding Sara Karamati Je ff Young Kenneth (Kent) - PowerPoint PPT Presentation

hpcgarage.org/sppexa16 Performance understanding Sara Karamati Je ff Young Kenneth (Kent) Czechowski [ Main Street Hub ] Jee Whan Choi [ IBM Research ] Oded Green [Bader lab @ GT] Marat Dukhan Anita Zakrzewska [Bader lab @ GT]

UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING (LMOU) LOCAL MEMORANDUM OF UNDERSTANDING

CS1063: Understanding CS1063: Understanding CS1063: Understanding CS1063: Understanding

Towards Understanding Towards Understanding Objectives Objectives Good basic understanding of

Understanding Business Expectations: Understanding Business Expectations: Understanding Business

Blended Analysis for Blended Analysis for Performance Understanding of Performance Understanding

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

2018 Understanding the status of NPE funding Understanding the changes to the ACIP Process

Understanding Others Understanding Others From Dots to Robots From Dots to Robots Ayse P.

Lunch n Learn Lunch n Learn Lunch n Learn Lunch n Learn Understanding Understanding

Understanding a Sites Traffic With Google Analytics UNDERSTANDING GOOGLE ANALYTICS Daniel

COMP31212: Concurrency Topics 2.3: Understanding FSP Topic 2.3: Understanding FSP Outline Topic

Performance and Scalability (Chapter 11) Performance and Scalability Performance: How long

March 2019 CONTENTS Page Combined Partner Performance 1 Breckland Performance Reports 2-6

Performance Bas Performance Bas Performance Bas Performance Bas ed ed ed ed Methodology for

Verification Verification, Performance Performance Analysis Performance Performance Analysis

Staffing to Acuity GHCA Quality Committee Objectives Understanding Your PPD And The Hours You

rdpress By Amir Shokri Amirsh.nll@gmail.com History Of Wordpress WordPress was released on May

WordPress Amir Shokri [ amirsh.nll@gmail.com ] graduate of the software Engineering of

Micro-Visualizations Jonathon Storrick Jon.Storrick@gmail.com Center for Computational Analysis

+ + What is a word cloud? Word Clouds Source:

Kodi - Open Source Home Entertainment Software (formerly known as XBMC) Ejal de Klerk Martijn

Pimping Your Site Michael Anello DrupalEasy @ultimike Monday, February 14, 2011 Awesome

LET THE PROJECT FLOW Front end and PM Lessons learned from building a Drupal 8 website for one of

Welcome To !&quot;#$%&amp;'()&amp;#*+,-.(/01+#$2

Welcome To !"#$%&'()&#*+,-.(/01+#$2