performance understanding

Performance understanding Sara Karamati Je ff Young Kenneth (Kent) - PowerPoint PPT Presentation

hpcgarage.org/sppexa16 Performance understanding Sara Karamati Je ff Young Kenneth (Kent) Czechowski [ Main Street Hub ] Jee Whan Choi [ IBM Research ] Oded Green [Bader lab @ GT] Marat Dukhan Anita Zakrzewska [Bader lab @ GT]


  1. hpcgarage.org/sppexa16 Performance understanding Sara Karamati · Je ff Young · Kenneth (Kent) Czechowski [ Main Street Hub ] · Jee Whan Choi [ IBM Research ] · Oded Green [Bader lab @ GT] · Marat Dukhan · Anita Zakrzewska [Bader lab @ GT] · Parsa Banihashmi & Ken Wills [Civil Engineering @ GT] · Richard (Rich) Vuduc January 26, 2016 — CATWALK session at the SPPEXA 2016 Symposium

  2. “Life can only be understood backwards; but it must be lived forwards.” — Søren Kierkegaard

  3. Main ideas of this talk: 1. Performance understanding can drive transformation(s).

  4. Main ideas of this talk: 1. Performance understanding can drive transformation(s). 2. Transformation(s) can drive performance understanding!

  5. Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power

  6. Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Oded Green, Marat Dukhan, RV. “Branch-avoiding graph algorithms.” In SPAA’15 . + Post-paper analysis help from Anita Zakrzewska

  7. Shiloach-Vishkin algorithm to compute connected components (as labels) forall v ∈ V do label[v] ← int(v) while … do forall v ∈ V do forall (v, u) ∈ E do if label[v] < label[u] then label[v] ← label[u] O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 .

  8. Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 0.9 0.6 SV 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:

  9. Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 Measured 0.9 0.6 SV 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:

  10. Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 0.9 0.6 SV Modeled (counters + lasso regression) 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:

  11. Predicted Cycles per instruction [Ivy Bridge] astro − ph audikw1 auto coAuthorsDBLP cond − mat − 2003 cond − mat − 2005 coPapersDBLP ecology1 ldoor power preferentialAttachment 1.2 0.9 0.6 SV 0.3 0.0 Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Branch − based Cache.references Cache.misses Branches Mispredictions Predicted values:

  12. Branch-based (original): Branch-avoiding: forall v ∈ V do forall v ∈ V do label[v] ← int(v) label[v] ← int(v) while … do while … do forall v ∈ V do forall v ∈ V do forall (v, u) ∈ E do forall (v, u) ∈ E do if label[u] < label[v] then flag ← (label[u] < label[v]) label[v] ← label[u] cmov (label[v], label[u], flag) O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 .

  13. Shiloach − Vishkin Connected Components: Cycles [Normalized to branch − based minimum] ldoor ● 1.12x Branch-based ● ● ● ● ● ● ● 1.1 ● ● ● ● Ivy Bridge ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 ● Branch-avoiding 0 20 40 60 Iteration O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 . Branch − based Branch − avoiding a a Implementation ●

  14. Branch-based (original): Branch-avoiding: forall v ∈ V do forall v ∈ V do label[v] ← int(v) label[v] ← int(v) while … do while … do forall v ∈ V do forall v ∈ V do forall (v, u) ∈ E do forall (v, u) ∈ E do if label[u] < label[v] then if label[u] < label[v] then flag ← (label[u] < label[v]) label[v] ← label[u] label[v] ← label[u] cmov (label[v], label[u], flag) O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15 .

  15. https://rjlipton.wordpress.com/2015/06/25/we-say-branches-and-you-say-choices/

  16. Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power

  17. Main ideas of this talk: 1. Performance understanding can drive transformation(s). Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power Sara Karamati, Je ff Young (PhD), R. Vuduc

  18. Example: Single-Source Shortest Path (SSSP) on a GPU Sara Karamati, Je ff Young (PhD), R. Vuduc

  19. ๏ Consider two implementation variants * ๏ “Bellman-Ford-like” — Highly parallel but not work-optimal ๏ “Delta-stepping-like” — Tunable work-parallelism tradeo ff ๏ No preprocessing shortcuts, a la PHAST ** ๏ Both are tuned * for a GPU and we run them on an NVIDIA Jetson TK1, which has tunable core frequencies ( 10x ) and memory frequencies ( 3x ) * These are GunRock implementations of Davidson, Baxter, Garland, and Owens (IPDPS’14) ** Delling et al. “PHAST: Hardware-accelerated shortest path trees” (JPDC’10)

  20. Parallelism (x 10 3 ) Iterations

  21. Main ideas of this talk: 1. Performance understanding can drive transformation(s). 2. Transformation(s) can drive performance understanding!

  22. Kent Czechowski Passive Active vs. (observational) (experimental)

  23. Kent Czechowski Passive Active vs. (observational) (experimental) Many related ideas! Environmental modifiers: DVFS (h/w), Gremlins (s/w) Code modifiers : autotuning , stochastic (super)optimizers

  24. Kent Czechowski Kent’s idea: Pressure point analysis (PPA) Iteratively rewrite the input program in a controlled fashion, then re-analyze it. Rewrites need not necessarily be semantics preserving!

  25. PPA: Conceptual example PPA: C ONCEPTUAL E XAMPLE nop ! Tri-Diagonal Elimination ! nop ! for ( i=1 ; i<n ; i++ ) { ! vsubsd xmm0, xmm1, xmm0 ! ! x[i] = z[i]*( y[i] - x[i-1] ); ! vmulsd xmm3, xmm0, xmm10 ! Compute Only nop ! } vsubsd xmm4, xmm2, xmm3 ! vmulsd xmm0, xmm4, xmm12 ! vmovsd xmm1, [8+rsi+r12] ! nop vmovsd xmm2, [16+rsi+r12] ! vsubsd xmm0, xmm1, xmm0 ! vmulsd xmm3, xmm0, [8+rsi+rbp] ! vmovsd xmm1, [8+rsi+r12] ! vmovsd [8+rsi+r13], xmm3 ! vmovsd xmm2, [16+rsi+r12] ! vsubsd xmm4, xmm2, xmm3 ! nop ! vmulsd xmm0, xmm4, [16+rsi+rbp] ! vmovsd xmm3, [8+rsi+rbp] ! vmovsd [16+rsi+r13], xmm0 Memory ! vmovsd [8+rsi+r13], xmm3 ! Access Only nop ! vmovsd xmm0, [16+rsi+rbp] ! vmovsd [16+rsi+r13], xmm0 Transformations need not (necessarily) preserve correctness!

  26. Can we account for all lost cycles? Automated battery of experiments ( ( • Frontend bottlenecks ! for ( k=0 ; k<n ; k++ ) { ! • Scheduling resource conflicts ! ! x[k] = u[k] + r*( z[k] + r*y[k] ) + ! • Data bypass delays ! ! ! t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) + ! • Cache latency stalls ! ! ! t*( u[k+6] + r*( u[k+5] + r*u[k+4] ) ) ); ! • Memory disambiguation conflicts ! } • Retirement bandwidth Kent’s progress: Micro-op caches, L1 bank conflicts, register scrambling, cache placement

  27. Can we account for all lost cycles? Automated battery of experiments ( ( • Frontend bottlenecks ! for ( k=0 ; k<n ; k++ ) { ! • Scheduling resource conflicts ! ! x[k] = u[k] + r*( z[k] + r*y[k] ) + ! • Data bypass delays ! ! ! t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) + ! • Cache latency stalls ! ! ! t*( u[k+6] + r*( u[k+5] + r*u[k+4] ) ) ); ! • Memory disambiguation conflicts ! } • Retirement bandwidth Kent’s progress: Micro-op caches, L1 bank conflicts, register scrambling , cache placement

Recommend


More recommend