hpcgarage.org/sppexa16 Performance understanding Sara Karamati · Je ff Young · Kenneth (Kent) Czechowski [ Main Street Hub ] · Jee Whan Choi [ IBM Research ] · Oded Green [Bader lab @ GT] · Marat Dukhan · Anita Zakrzewska [Bader lab @ GT] · Parsa Banihashmi & Ken Wills [Civil Engineering @ GT] · Richard (Rich) Vuduc January 26, 2016 — CATWALK session at the SPPEXA 2016 Symposium
“Life can only be understood backwards; but it must be lived forwards.” — Søren Kierkegaard
Main ideas of this talk: 1. Performance understanding can drive transformation(s).
Main ideas of this talk: 1. Performance understanding can drive transformation(s). 2. Transformation(s) can drive performance understanding!
Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power
Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Oded Green, Marat Dukhan, RV. “Branch-avoiding graph algorithms.” In SPAA’15 . + Post-paper analysis help from Anita Zakrzewska
Shiloach-Vishkin algorithm to compute connected components (as labels) forall v ∈ V do label[v] ← int(v) while … do forall v ∈ V do forall (v, u) ∈ E do if label[v] < label[u] then label[v] ← label[u] O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 .
Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 0.9 0.6 SV 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:
Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 Measured 0.9 0.6 SV 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:
Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 0.9 0.6 SV Modeled (counters + lasso regression) 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:
Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 0.9 0.6 SV 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:
Branch-based (original): Branch-avoiding: forall v ∈ V do forall v ∈ V do label[v] ← int(v) label[v] ← int(v) while … do while … do forall v ∈ V do forall v ∈ V do forall (v, u) ∈ E do forall (v, u) ∈ E do if label[u] < label[v] then flag ← (label[u] < label[v]) label[v] ← label[u] cmov (label[v], label[u], flag) O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 .
Shiloach − Vishkin Connected Components: Cycles [Normalized to branch − based minimum] ldoor ● 1.12x Branch-based ● ● ● ● ● ● ● 1.1 ● ● ● ● Ivy Bridge ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● Branch-avoiding 0 20 40 60 Iteration O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 . Branch − based Branch − avoiding a a Implementation ●
Branch-based (original): Branch-avoiding: forall v ∈ V do forall v ∈ V do label[v] ← int(v) label[v] ← int(v) while … do while … do forall v ∈ V do forall v ∈ V do forall (v, u) ∈ E do forall (v, u) ∈ E do if label[u] < label[v] then if label[u] < label[v] then flag ← (label[u] < label[v]) label[v] ← label[u] label[v] ← label[u] cmov (label[v], label[u], flag) O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 .
https://rjlipton.wordpress.com/2015/06/25/we-say-branches-and-you-say-choices/
Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power
Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power Sara Karamati, Je ff Young (PhD), R. Vuduc
Example: Single-Source Shortest Path (SSSP) on a GPU Sara Karamati, Je ff Young (PhD), R. Vuduc
๏ Consider two implementation variants * ๏ “Bellman-Ford-like” — Highly parallel but not work-optimal ๏ “Delta-stepping-like” — Tunable work-parallelism tradeo ff ๏ No preprocessing shortcuts, a la PHAST ** ๏ Both are tuned * for a GPU and we run them on an NVIDIA Jetson TK1, which has tunable core frequencies ( 10x ) and memory frequencies ( 3x ) * These are GunRock implementations of Davidson, Baxter, Garland, and Owens (IPDPS’14) ** Delling et al. “PHAST: Hardware-accelerated shortest path trees” (JPDC’10)
Parallelism (x 10 3 ) Iterations
Main ideas of this talk: 1. Performance understanding can drive transformation(s). 2. Transformation(s) can drive performance understanding!
Kent Czechowski Passive Active vs. (observational) (experimental)
Kent Czechowski Passive Active vs. (observational) (experimental) Many related ideas! Environmental modifiers: DVFS (h/w), Gremlins (s/w) Code modifiers : autotuning , stochastic (super)optimizers
Kent Czechowski Kent’s idea: Pressure point analysis (PPA) Iteratively rewrite the input program in a controlled fashion, then re-analyze it. Rewrites need not necessarily be semantics preserving!
PPA: Conceptual example PPA: C ONCEPTUAL E XAMPLE nop ! Tri-Diagonal Elimination ! nop ! for ( i=1 ; i<n ; i++ ) { ! vsubsd xmm0, xmm1, xmm0 ! ! x[i] = z[i]*( y[i] - x[i-1] ); ! vmulsd xmm3, xmm0, xmm10 ! Compute Only nop ! } vsubsd xmm4, xmm2, xmm3 ! vmulsd xmm0, xmm4, xmm12 ! vmovsd xmm1, [8+rsi+r12] ! nop vmovsd xmm2, [16+rsi+r12] ! vsubsd xmm0, xmm1, xmm0 ! vmulsd xmm3, xmm0, [8+rsi+rbp] ! vmovsd xmm1, [8+rsi+r12] ! vmovsd [8+rsi+r13], xmm3 ! vmovsd xmm2, [16+rsi+r12] ! vsubsd xmm4, xmm2, xmm3 ! nop ! vmulsd xmm0, xmm4, [16+rsi+rbp] ! vmovsd xmm3, [8+rsi+rbp] ! vmovsd [16+rsi+r13], xmm0 Memory ! vmovsd [8+rsi+r13], xmm3 ! Access Only nop ! vmovsd xmm0, [16+rsi+rbp] ! vmovsd [16+rsi+r13], xmm0 Transformations need not (necessarily) preserve correctness!
Can we account for all lost cycles? Automated battery of experiments ( ( • Frontend bottlenecks ! for ( k=0 ; k<n ; k++ ) { ! • Scheduling resource conflicts ! ! x[k] = u[k] + r*( z[k] + r*y[k] ) + ! • Data bypass delays ! ! ! t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) + ! • Cache latency stalls ! ! ! t*( u[k+6] + r*( u[k+5] + r*u[k+4] ) ) ); ! • Memory disambiguation conflicts ! } • Retirement bandwidth Kent’s progress: Micro-op caches, L1 bank conflicts, register scrambling, cache placement
Can we account for all lost cycles? Automated battery of experiments ( ( • Frontend bottlenecks ! for ( k=0 ; k<n ; k++ ) { ! • Scheduling resource conflicts ! ! x[k] = u[k] + r*( z[k] + r*y[k] ) + ! • Data bypass delays ! ! ! t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) + ! • Cache latency stalls ! ! ! t*( u[k+6] + r*( u[k+5] + r*u[k+4] ) ) ); ! • Memory disambiguation conflicts ! } • Retirement bandwidth Kent’s progress: Micro-op caches, L1 bank conflicts, register scrambling , cache placement
Recommend
More recommend