Performance understanding Sara Karamati Je ff Young Kenneth (Kent) - - PowerPoint PPT Presentation

performance understanding
SMART_READER_LITE
LIVE PREVIEW

Performance understanding Sara Karamati Je ff Young Kenneth (Kent) - - PowerPoint PPT Presentation

hpcgarage.org/sppexa16 Performance understanding Sara Karamati Je ff Young Kenneth (Kent) Czechowski [ Main Street Hub ] Jee Whan Choi [ IBM Research ] Oded Green [Bader lab @ GT] Marat Dukhan Anita Zakrzewska [Bader lab @ GT]


slide-1
SLIDE 1

hpcgarage.org/sppexa16

Performance understanding

Sara Karamati · Jeff Young · Kenneth (Kent) Czechowski [Main Street Hub]· Jee Whan Choi [IBM Research] · Oded Green [Bader lab @ GT]· Marat Dukhan · Anita Zakrzewska [Bader lab @ GT] · Parsa Banihashmi & Ken Wills [Civil Engineering @ GT] · Richard (Rich) Vuduc January 26, 2016 — CATWALK session at the SPPEXA 2016 Symposium

slide-2
SLIDE 2

“Life can only be understood backwards; but it must be lived forwards.”

— Søren Kierkegaard

slide-3
SLIDE 3

Main ideas of this talk:

  • 1. Performance understanding can drive transformation(s).
slide-4
SLIDE 4

Main ideas of this talk:

  • 1. Performance understanding can drive transformation(s).
  • 2. Transformation(s) can drive performance understanding!
slide-5
SLIDE 5

Main ideas of this talk:

  • 1. Performance understanding can drive transformation(s).

Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power

slide-6
SLIDE 6

Main ideas of this talk:

  • 1. Performance understanding can drive transformation(s).

Case study a) Branch-avoidance

Oded Green, Marat Dukhan, RV. “Branch-avoiding graph algorithms.” In SPAA’15. + Post-paper analysis help from Anita Zakrzewska

slide-7
SLIDE 7

Shiloach-Vishkin algorithm to compute connected components (as labels) forall v ∈ V do label[v] ← int(v) while … do forall v ∈ V do forall (v, u) ∈ E do if label[v] < label[u] then label[v] ← label[u]

  • O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15.
slide-8
SLIDE 8

astro−ph audikw1 auto coAuthorsDBLP cond−mat−2003 cond−mat−2005 coPapersDBLP ecology1 ldoor power preferentialAttachment

0.0 0.3 0.6 0.9 1.2

SV

Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based

Predicted values:

Cache.references Cache.misses Branches Mispredictions

Predicted Cycles per instruction [Ivy Bridge]

slide-9
SLIDE 9

astro−ph audikw1 auto coAuthorsDBLP cond−mat−2003 cond−mat−2005 coPapersDBLP ecology1 ldoor power preferentialAttachment

0.0 0.3 0.6 0.9 1.2

SV

Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based

Predicted values:

Cache.references Cache.misses Branches Mispredictions

Predicted Cycles per instruction [Ivy Bridge]

Measured

slide-10
SLIDE 10

astro−ph audikw1 auto coAuthorsDBLP cond−mat−2003 cond−mat−2005 coPapersDBLP ecology1 ldoor power preferentialAttachment

0.0 0.3 0.6 0.9 1.2

SV

Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based

Predicted values:

Cache.references Cache.misses Branches Mispredictions

Predicted Cycles per instruction [Ivy Bridge]

Modeled

(counters + lasso regression)

slide-11
SLIDE 11

astro−ph audikw1 auto coAuthorsDBLP cond−mat−2003 cond−mat−2005 coPapersDBLP ecology1 ldoor power preferentialAttachment

0.0 0.3 0.6 0.9 1.2

SV

Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based Branch−based

Predicted values:

Cache.references Cache.misses Branches Mispredictions

Predicted Cycles per instruction [Ivy Bridge]

slide-12
SLIDE 12

Branch-based (original): forall v ∈ V do label[v] ← int(v) while … do forall v ∈ V do forall (v, u) ∈ E do if label[u] < label[v] then label[v] ← label[u]

  • O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15.

Branch-avoiding: forall v ∈ V do label[v] ← int(v) while … do forall v ∈ V do forall (v, u) ∈ E do flag ← (label[u] < label[v]) cmov (label[v], label[u], flag)

slide-13
SLIDE 13

ldoor

  • 1.12x

1.0 1.1

Ivy Bridge

20 40 60 Iteration

Implementation

  • a

a

Branch−based Branch−avoiding

Shiloach−Vishkin Connected Components: Cycles

[Normalized to branch−based minimum]

Branch-avoiding Branch-based

  • O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15.
slide-14
SLIDE 14

Branch-based (original): forall v ∈ V do label[v] ← int(v) while … do forall v ∈ V do forall (v, u) ∈ E do if label[u] < label[v] then label[v] ← label[u]

  • O. Green, M. Dukhan, R. Vuduc. “Branch-avoiding graph algorithms.” In SPAA’15.

Branch-avoiding: forall v ∈ V do label[v] ← int(v) while … do forall v ∈ V do forall (v, u) ∈ E do flag ← (label[u] < label[v]) cmov (label[v], label[u], flag) if label[u] < label[v] then label[v] ← label[u]

slide-15
SLIDE 15

https://rjlipton.wordpress.com/2015/06/25/we-say-branches-and-you-say-choices/

slide-16
SLIDE 16

Main ideas of this talk:

  • 1. Performance understanding can drive transformation(s).

Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power

slide-17
SLIDE 17

Main ideas of this talk:

  • 1. Performance understanding can drive transformation(s).

Case study a) Branch-avoidance Case study b) A tunable graph algorithm that reduces power

Sara Karamati, Jeff Young (PhD), R. Vuduc

slide-18
SLIDE 18

Example:

Single-Source Shortest Path (SSSP) on a GPU

Sara Karamati, Jeff Young (PhD), R. Vuduc

slide-19
SLIDE 19

๏ Consider two implementation variants*

๏ “Bellman-Ford-like” — Highly parallel but not work-optimal ๏ “Delta-stepping-like” — Tunable work-parallelism tradeoff ๏ No preprocessing shortcuts, a la PHAST**

๏ Both are tuned* for a GPU and we run them on an NVIDIA Jetson TK1, which has

tunable core frequencies (10x) and memory frequencies (3x)

* These are GunRock implementations of Davidson, Baxter, Garland, and Owens (IPDPS’14) ** Delling et al. “PHAST: Hardware-accelerated shortest path trees” (JPDC’10)

slide-20
SLIDE 20

Parallelism (x 103) Iterations

slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23

Main ideas of this talk:

  • 1. Performance understanding can drive transformation(s).
  • 2. Transformation(s) can drive performance understanding!
slide-24
SLIDE 24

vs.

Passive

(observational)

Active

(experimental)

Kent Czechowski

slide-25
SLIDE 25

vs.

Passive

(observational)

Active

(experimental)

Kent Czechowski

Many related ideas! Environmental modifiers: DVFS (h/w), Gremlins (s/w) Code modifiers: autotuning, stochastic (super)optimizers

slide-26
SLIDE 26

Kent’s idea:

Pressure point analysis (PPA)

Iteratively rewrite the input program in a controlled fashion, then re-analyze it. Rewrites need not necessarily be semantics preserving!

Kent Czechowski

slide-27
SLIDE 27

PPA: CONCEPTUAL EXAMPLE

Compute Only Memory! Access Only

vmovsd xmm1, [8+rsi+r12]! vmovsd xmm2, [16+rsi+r12]! vsubsd xmm0, xmm1, xmm0! vmulsd xmm3, xmm0, [8+rsi+rbp]! vmovsd [8+rsi+r13], xmm3! vsubsd xmm4, xmm2, xmm3! vmulsd xmm0, xmm4, [16+rsi+rbp]! vmovsd [16+rsi+r13], xmm0

Tri-Diagonal Elimination! for ( i=1 ; i<n ; i++ ) {! ! x[i] = z[i]*( y[i] - x[i-1] );! }

nop! nop! vsubsd xmm0, xmm1, xmm0! vmulsd xmm3, xmm0, xmm10! nop! vsubsd xmm4, xmm2, xmm3! vmulsd xmm0, xmm4, xmm12! nop vmovsd xmm1, [8+rsi+r12]! vmovsd xmm2, [16+rsi+r12]! nop! vmovsd xmm3, [8+rsi+rbp]! vmovsd [8+rsi+r13], xmm3! nop! vmovsd xmm0, [16+rsi+rbp]! vmovsd [16+rsi+r13], xmm0

PPA: Conceptual example

Transformations need not (necessarily) preserve correctness!

slide-28
SLIDE 28

Automated battery of experiments

  • Frontend bottlenecks!
  • Scheduling resource conflicts!
  • Data bypass delays!
  • Cache latency stalls!
  • Memory disambiguation conflicts!
  • Retirement bandwidth

Can we account for all lost cycles?

(

(

for ( k=0 ; k<n ; k++ ) {! ! x[k] = u[k] + r*( z[k] + r*y[k] ) +! ! ! t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) +! ! ! t*( u[k+6] + r*( u[k+5] + r*u[k+4] ) ) );! }

Kent’s progress: Micro-op caches, L1 bank conflicts, register scrambling, cache placement

slide-29
SLIDE 29

Automated battery of experiments

  • Frontend bottlenecks!
  • Scheduling resource conflicts!
  • Data bypass delays!
  • Cache latency stalls!
  • Memory disambiguation conflicts!
  • Retirement bandwidth

Can we account for all lost cycles?

(

(

for ( k=0 ; k<n ; k++ ) {! ! x[k] = u[k] + r*( z[k] + r*y[k] ) +! ! ! t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) +! ! ! t*( u[k+6] + r*( u[k+5] + r*u[k+4] ) ) );! }

Kent’s progress: Micro-op caches, L1 bank conflicts, register scrambling, cache placement

slide-30
SLIDE 30

REGISTER SCRAMBLING

Scramble #1 Scramble #2

  • 0 inloop:

1 movsd xmm1, [88+r12+r9*8] 2 movsd xmm1, [104+r12+r9*8] 3 movsd xmm2, [120+r12+r9*8] 4 movsd xmm2, [136+r12+r9*8] 5 movaps xmm0, [80+r12+r9*8] 6 movhpd xmm1, [96+r12+r9*8] 7 movaps xmm2, [96+r12+r9*8] 8 movhpd xmm3, [112+r12+r9*8] 9 movaps xmm1, [112+r12+r9*8] 10 movhpd xmm0, [128+r12+r9*8] 11 movaps xmm0, [128+r12+r9*8] 12 movhpd xmm3, [144+r12+r9*8] 13 mulpd xmm1, xmm1 14 mulpd xmm0, xmm0 15 mulpd xmm1, xmm3 16 mulpd xmm3, xmm3 17 mulpd xmm2, xmm2 18 mulpd xmm3, xmm2 19 mulpd xmm1, xmm3 20 mulpd xmm3, xmm1 21 addpd xmm2, xmm1 22 addpd xmm0, xmm3 23 addpd xmm3, xmm3 24 addpd xmm2, xmm3 25 mulpd xmm0, [r15+r9*8] 26 mulpd xmm0, [16+r15+r9*8] 27 mulpd xmm3, [32+r15+r9*8] 28 mulpd xmm1, [48+r15+r9*8] 29 addpd xmm0, xmm3 30 addpd xmm0, xmm1 31 addpd xmm1, xmm0 32 addpd xmm1, xmm2 33 movaps [r11+r9*8], xmm3 34 movaps [16+r11+r9*8], xmm0 35 movaps [32+r11+r9*8], xmm1 36 movaps [48+r11+r9*8], xmm1 37 add r8, 1 38 cmp r8, rbx 39 jb inloop 0 inloop: 1 movsd xmm2, [88+r12+r9*8] 2 movsd xmm0, [104+r12+r9*8] 3 movsd xmm0, [120+r12+r9*8] 4 movsd xmm3, [136+r12+r9*8] 5 movaps xmm0, [80+r12+r9*8] 6 movhpd xmm3, [96+r12+r9*8] 7 movaps xmm0, [96+r12+r9*8] 8 movhpd xmm2, [112+r12+r9*8] 9 movaps xmm1, [112+r12+r9*8] 10 movhpd xmm0, [128+r12+r9*8] 11 movaps xmm1, [128+r12+r9*8] 12 movhpd xmm0, [144+r12+r9*8] 13 mulpd xmm3, xmm3 14 mulpd xmm3, xmm3 15 mulpd xmm3, xmm2 16 mulpd xmm0, xmm3 17 mulpd xmm2, xmm0 18 mulpd xmm0, xmm3 19 mulpd xmm0, xmm1 20 mulpd xmm1, xmm3 21 addpd xmm1, xmm2 22 addpd xmm1, xmm3 23 addpd xmm3, xmm1 24 addpd xmm2, xmm3 25 mulpd xmm0, [r15+r9*8] 26 mulpd xmm2, [16+r15+r9*8] 27 mulpd xmm1, [32+r15+r9*8] 28 mulpd xmm1, [48+r15+r9*8] 29 addpd xmm3, xmm2 30 addpd xmm1, xmm2 31 addpd xmm2, xmm1 32 addpd xmm2, xmm1 33 movaps [r11+r9*8], xmm3 34 movaps [16+r11+r9*8], xmm2 35 movaps [32+r11+r9*8], xmm1 36 movaps [48+r11+r9*8], xmm1 37 add r8, 1 39 cmp r8, rbx 39 jb inloop

Cycles per Iteration: 31.51 (IPC=1.24) Average Power: 37.32 watts Energy per Iteration: 84.0 nJ Cycles per Iteration: 19.65 (IPC=1.98) Average Power: 42.02 watts Energy per Iteration: 59.0 nJ

Cycles Per Iteration: 31. 51 (IPC=1.24)! AVG Power: 37.32 watts! Energy Per Iteration: 84.0 nJ Cycles Per Iteration: 19.65 (IPC=1.98)! AVG Power: 42.02 watts! Energy Per Iteration: 59.0 nJ

slide-31
SLIDE 31

y = 5.59x + 30.79 R² = 0.99 0.00 10.00 20.00 30.00 40.00 50.00 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

Power (Watts) IPC

Kernel 1 Scrambling on HSW

slide-32
SLIDE 32

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 PNY NHM WSM SNB IVB HSW

Efficiency Relative to PNY

Improvement in Energy Efficiency Livermore Loops

SIMD Ext Frontend Backend 22nm-Process 32nm-Process Base

1.5x

  • K. Czechowski et al. “Improving the energy-efficiency of big cores.” In ISCA’14.

(’07) (’13)

slide-33
SLIDE 33

Main ideas of this talk:

  • 1. Performance understanding can drive transformation(s).

a) Branch-avoiding algorithms b) A tunable graph algorithm that reduces power

  • 2. Transformation(s) can drive performance understanding!

Pressure point analysis to pinpoint bottlenecks