Exploiting Modern Hardware Features via Lightweight Profiling - - PowerPoint PPT Presentation

exploiting modern hardware features via lightweight
SMART_READER_LITE
LIVE PREVIEW

Exploiting Modern Hardware Features via Lightweight Profiling - - PowerPoint PPT Presentation

Exploiting Modern Hardware Features via Lightweight Profiling Probir Roy Scalable Tools Workshop19 1 High performance and challenges IBM POWER 9 CPU Exploiting Modern Hardware Features via Lightweight Profiling 2 High performance and


slide-1
SLIDE 1

Exploiting Modern Hardware Features via Lightweight Profiling

Probir Roy

Scalable Tools Workshop’19

1

slide-2
SLIDE 2

High performance and challenges

Exploiting Modern Hardware Features via Lightweight Profiling

IBM POWER 9 CPU

2

slide-3
SLIDE 3

High performance and challenges

Exploiting Modern Hardware Features via Lightweight Profiling

Amazon CloudFront

IBM POWER 9 CPU

2

slide-4
SLIDE 4

High performance and challenges

Exploiting Modern Hardware Features via Lightweight Profiling

Amazon CloudFront

NAMD IBM POWER 9 CPU

2

slide-5
SLIDE 5

High performance and challenges

Exploiting Modern Hardware Features via Lightweight Profiling

Amazon CloudFront

NAMD

MPI

IBM POWER 9 CPU

2

slide-6
SLIDE 6

High performance and challenges

Exploiting Modern Hardware Features via Lightweight Profiling

Amazon CloudFront

NAMD

MPI

IBM POWER 9 CPU

Common Goal: Performance

2

slide-7
SLIDE 7

High performance and challenges

Exploiting Modern Hardware Features via Lightweight Profiling

Amazon CloudFront

NAMD

MPI

IBM POWER 9 CPU

Common Goal: Performance Var ariable ch characteristics of hardware and software is a ch challe lenge

2

slide-8
SLIDE 8

High performance and challenges

Exploiting Modern Hardware Features via Lightweight Profiling

Amazon CloudFront

NAMD

MPI

IBM POWER 9 CPU

Common Goal: Performance Var ariable ch characteristics of hardware and software is a ch challe lenge De Deep in insig sights

2

slide-9
SLIDE 9

High performance and challenges

Exploiting Modern Hardware Features via Lightweight Profiling

http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/

3

slide-10
SLIDE 10

High performance and challenges

Exploiting Modern Hardware Features via Lightweight Profiling

Peak FLOPS per socket increasing at 50 50%-60% per per yea ear

http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/

3

slide-11
SLIDE 11

High performance and challenges

Exploiting Modern Hardware Features via Lightweight Profiling

Peak FLOPS per socket increasing at 50 50%-60% per per yea ear Memory bandwidth increasing at ~2 ~23% per per yea ear

http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/

3

slide-12
SLIDE 12

High performance and challenges

Exploiting Modern Hardware Features via Lightweight Profiling

Peak FLOPS per socket increasing at 50 50%-60% per per yea ear Memory bandwidth increasing at ~2 ~23% per per yea ear Memory latency increasing at ~4 ~4% per per yea ear

http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/

3

slide-13
SLIDE 13

Steps of performance analysis

Exploiting Modern Hardware Features via Lightweight Profiling

Application

4

slide-14
SLIDE 14

Steps of performance analysis

Exploiting Modern Hardware Features via Lightweight Profiling

Application What? Why? How?

4

slide-15
SLIDE 15

Steps of performance analysis

Exploiting Modern Hardware Features via Lightweight Profiling

Profiler Application What? Why? How?

4

slide-16
SLIDE 16

Steps of performance analysis

Exploiting Modern Hardware Features via Lightweight Profiling

Profiler Application What? Why? How?

Simulation Measurement

4

slide-17
SLIDE 17

Steps of performance analysis

Exploiting Modern Hardware Features via Lightweight Profiling

Profiler Profiles Application What? Why? How?

Simulation Measurement

4

slide-18
SLIDE 18

Steps of performance analysis

Exploiting Modern Hardware Features via Lightweight Profiling

Profiler Profiles Analyzer Application What? Why? How?

Simulation Measurement

4

slide-19
SLIDE 19

Steps of performance analysis

Exploiting Modern Hardware Features via Lightweight Profiling

Profiler Profiles Analyzer Code

  • ptimization

Application What? Why? How?

Simulation Measurement

4

slide-20
SLIDE 20

Steps of performance analysis

Exploiting Modern Hardware Features via Lightweight Profiling

Profiler Profiles Analyzer Code

  • ptimization

Application What? Why? How?

Simulation Measurement

4

slide-21
SLIDE 21

Steps of performance analysis

Exploiting Modern Hardware Features via Lightweight Profiling

Profiler Profiles Analyzer Code

  • ptimization

Application What? Why? How?

Simulation Measurement

4

slide-22
SLIDE 22

Limitations of performance analysis (Cont.)

Low overhead Deep insight

Simulation methods (PinTool, GPGPUSim, GEMS) Measurement methods (Perf, Oprofile, PAPI)

Shallow insight High overhead

Exploiting Modern Hardware Features via Lightweight Profiling 5

slide-23
SLIDE 23

Limitations of performance analysis (Cont.)

Low overhead Deep insight

Simulation methods (PinTool, GPGPUSim, GEMS) Measurement methods (Perf, Oprofile, PAPI)

Shallow insight High overhead

Cache simulation: average 38x (Xiang et al. A higher order theory of locality )

Exploiting Modern Hardware Features via Lightweight Profiling 5

slide-24
SLIDE 24

Limitations of performance analysis (Cont.)

Low overhead Deep insight

Simulation methods (PinTool, GPGPUSim, GEMS) Measurement methods (Perf, Oprofile, PAPI)

Shallow insight High overhead

Selective instrumentation: 2x - 5x (Rane et al. MACPO) Cache simulation: average 38x (Xiang et al. A higher order theory of locality )

Exploiting Modern Hardware Features via Lightweight Profiling 5

slide-25
SLIDE 25

Limitations of performance analysis (Cont.)

Low overhead Deep insight

Simulation methods (PinTool, GPGPUSim, GEMS) Measurement methods (Perf, Oprofile, PAPI)

Shallow insight High overhead

Selective instrumentation: 2x - 5x (Rane et al. MACPO) Cache simulation: average 38x (Xiang et al. A higher order theory of locality ) Profiling: < 10% (Liu et al. A Data-centric Profiler for Parallel Programs)

Exploiting Modern Hardware Features via Lightweight Profiling 5

slide-26
SLIDE 26

Limitations of performance analysis (Cont.)

Low overhead Deep insight

Simulation methods (PinTool, GPGPUSim, GEMS) Measurement methods (Perf, Oprofile, PAPI)

Shallow insight High overhead

Goal

Selective instrumentation: 2x - 5x (Rane et al. MACPO) Cache simulation: average 38x (Xiang et al. A higher order theory of locality ) Profiling: < 10% (Liu et al. A Data-centric Profiler for Parallel Programs)

Exploiting Modern Hardware Features via Lightweight Profiling 5

slide-27
SLIDE 27

Research statement

Exploiting Modern Hardware Features via Lightweight Profiling

Low overhead Deep insight

Simulation methods (PinTool, GPGPUSim, GEMS) Measurement methods (Perf, Oprofile, PAPI)

Shallow insight High overhead

6

slide-28
SLIDE 28

Research statement

Exploiting Modern Hardware Features via Lightweight Profiling

Low overhead Deep insight

Simulation methods (PinTool, GPGPUSim, GEMS) Measurement methods (Perf, Oprofile, PAPI)

Shallow insight High overhead

Lightweight profiling with PMUs can provide deep insights into performance issues caused by memory hierarchies and poor algorithm choice

6

slide-29
SLIDE 29

Research statement

Exploiting Modern Hardware Features via Lightweight Profiling

Low overhead Deep insight

Simulation methods (PinTool, GPGPUSim, GEMS) Measurement methods (Perf, Oprofile, PAPI)

Shallow insight High overhead

Tools to detect memory and computational inefficiency

Lightweight profiling with PMUs can provide deep insights into performance issues caused by memory hierarchies and poor algorithm choice

6

slide-30
SLIDE 30

My research at a glance

Exploiting Modern Hardware Features via Lightweight Profiling

CPU

Cache

Physical Memory Memory Inefficiency Programming model

7

slide-31
SLIDE 31

My research at a glance

Exploiting Modern Hardware Features via Lightweight Profiling

CPU

Cache

Physical Memory

Simultaneous multi-threading

Memory contention

Memory Inefficiency Programming model

7

slide-32
SLIDE 32

My research at a glance

Exploiting Modern Hardware Features via Lightweight Profiling

CPU

Cache

Physical Memory

Simultaneous multi-threading

Memory contention

Set-associative cache Conflict miss Memory Inefficiency Programming model

7

slide-33
SLIDE 33

My research at a glance

Exploiting Modern Hardware Features via Lightweight Profiling

CPU

Cache

Physical Memory

Simultaneous multi-threading

Memory contention

Set-associative cache Conflict miss Cache line utilization Memory Inefficiency Programming model

7

slide-34
SLIDE 34

My research at a glance

Exploiting Modern Hardware Features via Lightweight Profiling

CPU

Cache

Physical Memory

Simultaneous multi-threading

Memory contention

Set-associative cache Conflict miss Cache line utilization

Non-uniform memory Scalability

Memory Inefficiency Programming model

7

slide-35
SLIDE 35

My research at a glance

Exploiting Modern Hardware Features via Lightweight Profiling

CPU

Cache

Physical Memory

Simultaneous multi-threading

Memory contention

Set-associative cache Conflict miss Cache line utilization

Non-uniform memory Scalability

Memory Inefficiency Programming model

Charm++

7

slide-36
SLIDE 36

My research at a glance

Exploiting Modern Hardware Features via Lightweight Profiling

CPU

Cache

Physical Memory

Simultaneous multi-threading

Memory contention

Set-associative cache Conflict miss Cache line utilization

Non-uniform memory Scalability

Memory Inefficiency Programming model

Charm++

7

slide-37
SLIDE 37

Outline

  • Lightweight profiling
  • SMT-aware optimization
  • Detection of cache conflicts
  • Guiding data-structure layout transformation

Exploiting Modern Hardware Features via Lightweight Profiling 8

slide-38
SLIDE 38

Lightweight memory profiling

  • Hardware profiling
  • Event based sampling
  • Intel (Precise event based sampling - PEBS)
  • AMD (Instruction based sampling - IBS)
  • IBM (Marked event sampling - MRK)

Exploiting Modern Hardware Features via Lightweight Profiling

PMU

9

slide-39
SLIDE 39

Lightweight memory profiling

  • Hardware profiling
  • Event based sampling
  • Intel (Precise event based sampling - PEBS)
  • AMD (Instruction based sampling - IBS)
  • IBM (Marked event sampling - MRK)

Exploiting Modern Hardware Features via Lightweight Profiling

PMU

Application

Time

9

slide-40
SLIDE 40

Lightweight memory profiling

  • Hardware profiling
  • Event based sampling
  • Intel (Precise event based sampling - PEBS)
  • AMD (Instruction based sampling - IBS)
  • IBM (Marked event sampling - MRK)

Exploiting Modern Hardware Features via Lightweight Profiling

PMU

Application

Time

9

slide-41
SLIDE 41

Lightweight memory profiling

  • Hardware profiling
  • Event based sampling
  • Intel (Precise event based sampling - PEBS)
  • AMD (Instruction based sampling - IBS)
  • IBM (Marked event sampling - MRK)

Exploiting Modern Hardware Features via Lightweight Profiling

PMU

Sample Sample Sample Sample

Application

Time

9

slide-42
SLIDE 42

Lightweight memory profiling

  • Hardware profiling
  • Event based sampling
  • Intel (Precise event based sampling - PEBS)
  • AMD (Instruction based sampling - IBS)
  • IBM (Marked event sampling - MRK)

Exploiting Modern Hardware Features via Lightweight Profiling

PMU

Sample Sample Sample Sample

Application

Time Reference Type Data Address Instruction Pointer

{L1 miss, L2 hit etc.}

9

slide-43
SLIDE 43

Outline

✓Lightweight profiling ✓SMT-aware optimization

  • Detection of cache conflicts
  • Guiding data-structure layout transformation

Exploiting Modern Hardware Features via Lightweight Profiling 10

slide-44
SLIDE 44

SMT-Aware In Instantaneous Footprint Optimization

Probir Roy, Shuaiwen Leon Song, Xu Liu [HPDC – 2016]

Exploiting Modern Hardware Features via Lightweight Profiling 11

slide-45
SLIDE 45

SMT (Simultaneous Multi-Threading)

Superscalar 2-way SMT Clock Cycles Thread 1 Thread 2 Idle Cycle

Exploiting Modern Hardware Features via Lightweight Profiling 12

slide-46
SLIDE 46

SMT scalability

Runtime ratio = SMT runtime / non-SMT runtime

Shared memory SPMD application

Exploiting Modern Hardware Features via Lightweight Profiling

Runtime ratio

Lower is better

13

slide-47
SLIDE 47

SMT architecture: shared cache

Exploiting Modern Hardware Features via Lightweight Profiling

Thread 1 Thread 2 Core 1 Thread 3 Thread 4 Core 1

L1 Cache L1 Cache L2 Cache LLC Cache

14

slide-48
SLIDE 48

SMT: Memory scalability

SMT scaling factor (F) = access Latency of SMT/ access Latency of non-SMT

Exploiting Modern Hardware Features via Lightweight Profiling

SMT scaling factor

Lower is better

15

slide-49
SLIDE 49

Characterization based on sensitivity

(L,F) Benchmarks Characterization (high, high) srad, streamcluster1, Lulesh2.0, IRSmk, LU, 3D tensor, Stencil, streamcluster2, hotspot, Clomp potentially sensitive to mem-centric SMT

  • ptimizations

(high, low) lud, needle, bfs, nn, bp, canneal, Ferret not clear if they can further benefit from SMT optimizations (low, high) leucocite, heartwall, pathfinder, myocyte little benefit from mem-centric SMT

  • ptimization

(low, low) b+tree, cfd, kmeans, lavaMD, particle filter, hotspot3D, blackscholes, bodytrack, facesim, Swaptions good memory performance with SMT enabled

L = Memory Access Latency; F = scaling factor

Exploiting Modern Hardware Features via Lightweight Profiling 16

slide-50
SLIDE 50

Characterization based on sensitivity

(L,F) Benchmarks Characterization (high, high) srad, streamcluster1, Lulesh2.0, IRSmk, LU, 3D tensor, Stencil, streamcluster2, hotspot, Clomp potentially sensitive to mem-centric SMT

  • ptimizations

(high, low) lud, needle, bfs, nn, bp, canneal, Ferret not clear if they can further benefit from SMT optimizations (low, high) leucocite, heartwall, pathfinder, myocyte little benefit from mem-centric SMT

  • ptimization

(low, low) b+tree, cfd, kmeans, lavaMD, particle filter, hotspot3D, blackscholes, bodytrack, facesim, Swaptions good memory performance with SMT enabled

L = Memory Access Latency; F = scaling factor

Exploiting Modern Hardware Features via Lightweight Profiling 16

slide-51
SLIDE 51

Source of memory contention

Exploiting Modern Hardware Features via Lightweight Profiling

Little/no locality

17

slide-52
SLIDE 52

Source of memory contention

Exploiting Modern Hardware Features via Lightweight Profiling

Little/no locality Intra-thread

SMT thread 2 SMT thread 1 time space

17

slide-53
SLIDE 53

Source of memory contention

Exploiting Modern Hardware Features via Lightweight Profiling

Little/no locality Intra-thread

SMT thread 2 SMT thread 1 time space

17

slide-54
SLIDE 54

Source of memory contention

Exploiting Modern Hardware Features via Lightweight Profiling

Little/no locality Intra-thread

SMT thread 2 SMT thread 1 time space

Optimization: Improve cache line utilization

17

slide-55
SLIDE 55

Source of memory contention

Exploiting Modern Hardware Features via Lightweight Profiling

Little/no locality Inter-thread Intra-thread

SMT thread 2 SMT thread 1 time space

Optimization: Improve cache line utilization

17

slide-56
SLIDE 56

Source of memory contention

Exploiting Modern Hardware Features via Lightweight Profiling

Little/no locality Inter-thread Intra-thread

SMT thread 2 SMT thread 1 time space

Optimization: Improve cache line utilization

time space

17

slide-57
SLIDE 57

Source of memory contention

Exploiting Modern Hardware Features via Lightweight Profiling

Little/no locality Inter-thread Intra-thread

SMT thread 2 SMT thread 1 time space

Optimization: Improve cache line utilization

Cache line 1 Cache line 1 time space

17

slide-58
SLIDE 58

Source of memory contention

Exploiting Modern Hardware Features via Lightweight Profiling

Little/no locality Inter-thread Intra-thread

SMT thread 2 SMT thread 1 time space

Optimization: Improve cache line utilization

Cache line 1 Cache line 1 time space

Optimization: Collaboration

17

slide-59
SLIDE 59

SMT locality (Stencil code)

Exploiting Modern Hardware Features via Lightweight Profiling

#pragma omp parallel for for (int i=T; i<N-T; i++) for (int j=T; j<N-T; j++) for (int k=0; k<T; k++) R[i][j] = matrix[i][j] + matrix[i-k][j]+matrix[i][j-k] + matrix[i+k][j]+matrix[i][j+k];

18

slide-60
SLIDE 60

SMT locality (Stencil code)

Exploiting Modern Hardware Features via Lightweight Profiling

#pragma omp parallel for for (int i=T; i<N-T; i++) for (int j=T; j<N-T; j++) for (int k=0; k<T; k++) R[i][j] = matrix[i][j] + matrix[i-k][j]+matrix[i][j-k] + matrix[i+k][j]+matrix[i][j+k]; Thread 1 Thread 2

18

slide-61
SLIDE 61

SMT locality (Stencil code)

Exploiting Modern Hardware Features via Lightweight Profiling

#pragma omp parallel for for (int i=T; i<N-T; i++) for (int j=T; j<N-T; j++) for (int k=0; k<T; k++) R[i][j] = matrix[i][j] + matrix[i-k][j]+matrix[i][j-k] + matrix[i+k][j]+matrix[i][j+k];

schedule(static,1)

Thread 1 Thread 2

18

slide-62
SLIDE 62

SMT locality (Stencil code)

Exploiting Modern Hardware Features via Lightweight Profiling

#pragma omp parallel for for (int i=T; i<N-T; i++) for (int j=T; j<N-T; j++) for (int k=0; k<T; k++) R[i][j] = matrix[i][j] + matrix[i-k][j]+matrix[i][j-k] + matrix[i+k][j]+matrix[i][j+k];

schedule(static,1)

Thread 2 Thread 1 Thread 1 Thread 2

18

slide-63
SLIDE 63

SMT locality (Stencil code)

Exploiting Modern Hardware Features via Lightweight Profiling

#pragma omp parallel for for (int i=T; i<N-T; i++) for (int j=T; j<N-T; j++) for (int k=0; k<T; k++) R[i][j] = matrix[i][j] + matrix[i-k][j]+matrix[i][j-k] + matrix[i+k][j]+matrix[i][j+k];

schedule(static,1)

Thread 2 Thread 1 Thread 1 Thread 2

18

slide-64
SLIDE 64

SMT-Analyzer: Analyzing memory access pattern

Exploiting Modern Hardware Features via Lightweight Profiling

threads Memory range T1 T2 T3 T4 T5

Loop

19

slide-65
SLIDE 65

SMT-Analyzer: Analyzing memory access pattern

Exploiting Modern Hardware Features via Lightweight Profiling

threads Memory range T1 T2 T3 T4 T5

PMU Loop

19

slide-66
SLIDE 66

SMT-Analyzer: Analyzing memory access pattern

Exploiting Modern Hardware Features via Lightweight Profiling

threads Memory range T1 T2 T3 T4 T5

PMU

S S S S

Application Time

Reference Type Data Address Instruction Pointer Thread

Loop

19

slide-67
SLIDE 67

SMT-Analyzer: Analyzing memory access pattern

Exploiting Modern Hardware Features via Lightweight Profiling

threads Memory range T1 T2 T3 T4 T5

PMU

S S S S

Application Time

Reference Type Data Address Instruction Pointer Thread

Loop

19

slide-68
SLIDE 68

Benchmarks

Benchmarks bottleneck region % of total latency

  • verhead

OPT method Speedups lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72×

Exploiting Modern Hardware Features via Lightweight Profiling 20

slide-69
SLIDE 69

Benchmarks

Benchmarks bottleneck region % of total latency

  • verhead

OPT method Speedups lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72×

Exploiting Modern Hardware Features via Lightweight Profiling 20

slide-70
SLIDE 70

Benchmarks

Benchmarks bottleneck region % of total latency

  • verhead

OPT method Speedups lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72×

Exploiting Modern Hardware Features via Lightweight Profiling 20

slide-71
SLIDE 71

Benchmarks

Benchmarks bottleneck region % of total latency

  • verhead

OPT method Speedups lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72×

Exploiting Modern Hardware Features via Lightweight Profiling 20

slide-72
SLIDE 72

Benchmarks

Benchmarks bottleneck region % of total latency

  • verhead

OPT method Speedups lulesh2.0 lulesh.cc: 604-609 3.6% +3.1% inter-thread 1.43× IRSmk rmatmult3.c: 86-103 78.6% +3.2% intra-thread 4.86× needle needle.cpp:185-187 20% +2.99% inter-thread 2.37× srad srad.cpp:136-167 80.1% +2.47% intra-thread 1.74× LU rhs.f:318-328 8.4% +10.6% inter-thread 1.36× Stencil stencil.c:16-21 95.7% +1.55% inter-thread 10.9× 3D tensor mt.c: 22-22 69.4% +2.4% inter-thread 1.44× streamcluster2 streamcluster.cpp:653 14.1% +15.2% inter-thread 6.72×

Exploiting Modern Hardware Features via Lightweight Profiling

Related work: MACPO (selective instrumentation): 2x - 5x

20

slide-73
SLIDE 73

Outline

✓Lightweight profiling ✓SMT-aware optimization

  • Detection of cache conflicts
  • Guiding data-structure layout transformation

Exploiting Modern Hardware Features via Lightweight Profiling 21

slide-74
SLIDE 74

Lig ightweight Detection of f Cache Conflicts

Probir Roy, Shuaiwen Leon Song, Sriram Krishnamoorthy, Xu Liu

[CGO – 2018]

Exploiting Modern Hardware Features via Lightweight Profiling 22

slide-75
SLIDE 75

Set-associative cache

Exploiting Modern Hardware Features via Lightweight Profiling

8 way

Set 0 Set 1 Set 63

. . .

Cache Line

Intel Skylake

L1 cache: 32 KB

23

slide-76
SLIDE 76

Set-associative cache

Exploiting Modern Hardware Features via Lightweight Profiling

8 way

Set 0 Set 1 Set 63

. . .

Cache Line

Address

64 Bits

Intel Skylake

L1 cache: 32 KB

23

slide-77
SLIDE 77

Set-associative cache

Exploiting Modern Hardware Features via Lightweight Profiling

8 way

Set 0 Set 1 Set 63

. . .

Cache Line

Address

64 Bits TAG SET Index Offset

Intel Skylake

L1 cache: 32 KB

23

slide-78
SLIDE 78

Set-associative cache

Exploiting Modern Hardware Features via Lightweight Profiling

8 way

Set 0 Set 1 Set 63

. . .

Cache Line

Address

64 Bits TAG SET Index Offset

Intel Skylake

L1 cache: 32 KB

23

slide-79
SLIDE 79

Set conflict

Exploiting Modern Hardware Features via Lightweight Profiling

… … … … …

[0] [1] [2] [127] [0] [1] [2] [20,000]

double Array [20,000][128];

24

slide-80
SLIDE 80

Set conflict

Exploiting Modern Hardware Features via Lightweight Profiling

… … … … …

[0] [1] [2] [127] [0] [1] [2] [20,000]

double Array [20,000][128];

Set mapping

1 15 16 17 31 32 33 47 48 49 63 1 15 16 17 31 48 49 63

128

24

slide-81
SLIDE 81

Set conflict

Exploiting Modern Hardware Features via Lightweight Profiling

… … … … …

[0] [1] [2] [127] [0] [1] [2] [20,000]

double Array [20,000][128];

Set mapping

1 15 16 17 31 32 33 47 48 49 63 1 15 16 17 31 48 49 63

128

24

slide-82
SLIDE 82

Set conflict

Exploiting Modern Hardware Features via Lightweight Profiling

… … … … …

[0] [1] [2] [127] [0] [1] [2] [20,000]

double Array [20,000][128];

Set mapping

1 15 16 17 31 32 33 47 48 49 63 1 15 16 17 31 48 49 63

128

16 32 48 16 32 48 Time

24

slide-83
SLIDE 83

Set conflict

Exploiting Modern Hardware Features via Lightweight Profiling

… … … … …

[0] [1] [2] [127] [0] [1] [2] [20,000]

double Array [20,000][128];

Set mapping

1 15 16 17 31 32 33 47 48 49 63 1 15 16 17 31 48 49 63

128

16 32 48 16 32 48 Time Pad

24

slide-84
SLIDE 84

Set conflict

Exploiting Modern Hardware Features via Lightweight Profiling

… … … … …

[0] [1] [2] [127] [0] [1] [2] [20,000]

double Array [20,000][128];

Set mapping

1 15 16 17 31 32 33 47 48 49 63 1 15 16 17 31 48 49 63

128

Set mapping after padding

1 15 16 17 32 33 49 51 52 2 4 5 19 21 22 36 55 56 6 18 34 35 50 20 37 7 3

Pad

128

16 32 48 16 32 48 Time Pad

24

slide-85
SLIDE 85

Set conflict

Exploiting Modern Hardware Features via Lightweight Profiling

… … … … …

[0] [1] [2] [127] [0] [1] [2] [20,000]

double Array [20,000][128];

Set mapping

1 15 16 17 31 32 33 47 48 49 63 1 15 16 17 31 48 49 63

128

Set mapping after padding

1 15 16 17 32 33 49 51 52 2 4 5 19 21 22 36 55 56 6 18 34 35 50 20 37 7 3

Pad

128

16 32 48 16 32 48 Time Pad

24

slide-86
SLIDE 86

Set conflict

Exploiting Modern Hardware Features via Lightweight Profiling

… … … … …

[0] [1] [2] [127] [0] [1] [2] [20,000]

double Array [20,000][128];

Set mapping

1 15 16 17 31 32 33 47 48 49 63 1 15 16 17 31 48 49 63

128

Set mapping after padding

1 15 16 17 32 33 49 51 52 2 4 5 19 21 22 36 55 56 6 18 34 35 50 20 37 7 3

Pad

128

16 32 48 16 32 48 Time 8 17 34 51 4 21 48 55 Time Pad

24

slide-87
SLIDE 87

Set conflict

Exploiting Modern Hardware Features via Lightweight Profiling

… … … … …

[0] [1] [2] [127] [0] [1] [2] [20,000]

double Array [20,000][128];

Set mapping

1 15 16 17 31 32 33 47 48 49 63 1 15 16 17 31 48 49 63

128

Set mapping after padding

1 15 16 17 32 33 49 51 52 2 4 5 19 21 22 36 55 56 6 18 34 35 50 20 37 7 3

Pad

128

16 32 48 16 32 48 Time 8 17 34 51 4 21 48 55 Time Pad

Is your application suffering conflict cache miss?

24

slide-88
SLIDE 88

L1 cache Conflict cache miss

Memory trace Cache simulator Classifying miss Simulation methods

Exploiting Modern Hardware Features via Lightweight Profiling

Trace driven cache simulation

A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Time

25

slide-89
SLIDE 89

L1 cache Conflict cache miss

Memory trace Cache simulator Classifying miss Simulation methods

Exploiting Modern Hardware Features via Lightweight Profiling

Trace driven cache simulation

A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Time Overhead: average 38 times

25

slide-90
SLIDE 90

L1 cache Conflict cache miss

Memory trace Cache simulator Classifying miss Simulation methods

Exploiting Modern Hardware Features via Lightweight Profiling

Trace driven cache simulation

A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Time

Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356.

Overhead: average 38 times

25

slide-91
SLIDE 91

L1 cache Conflict cache miss

Memory trace Cache simulator Classifying miss Simulation methods

Exploiting Modern Hardware Features via Lightweight Profiling

Trace driven cache simulation

A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Time

High overhead

Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356.

Overhead: average 38 times

25

slide-92
SLIDE 92

L1 cache Conflict cache miss

Memory trace Cache simulator Classifying miss Simulation methods

Exploiting Modern Hardware Features via Lightweight Profiling

Trace driven cache simulation

A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Time

High overhead

Difficult to simulate hardware

Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356.

Overhead: average 38 times

25

slide-93
SLIDE 93

L1 cache Conflict cache miss

Memory trace Cache simulator Classifying miss Simulation methods

Exploiting Modern Hardware Features via Lightweight Profiling

Trace driven cache simulation

A[0][0] A[1][0] A[2][0] A[0][0] A[3][0] A[2][0] … A[0][0] Time

High overhead

Difficult to simulate hardware

Difficult in practice Theoretically accurate

Xiang, Xiaoya, Chen Ding, Hao Luo, and Bin Bao. "HOTL: a higher order theory of locality." ACM SIGPLAN Notices 48, no. 4 (2013): 343-356.

Overhead: average 38 times

25

slide-94
SLIDE 94

Exploiting Modern Hardware Features via Lightweight Profiling

Memory trace Cache simulator Classifying miss Simulation methods

A practical low overhead solution

26

slide-95
SLIDE 95

Exploiting Modern Hardware Features via Lightweight Profiling

CCProf

Memory trace Cache simulator Classifying miss Simulation methods

A practical low overhead solution

26

slide-96
SLIDE 96

Exploiting Modern Hardware Features via Lightweight Profiling

CCProf

Measurement methods Memory trace Cache simulator Classifying miss Simulation methods

A practical low overhead solution

26

slide-97
SLIDE 97

Exploiting Modern Hardware Features via Lightweight Profiling

CCProf

Memory sampling Statistical analysis Classifying miss Measurement methods Memory trace Cache simulator Classifying miss Simulation methods

A practical low overhead solution

26

slide-98
SLIDE 98

Exploiting Modern Hardware Features via Lightweight Profiling

CCProf

Memory sampling Statistical analysis Classifying miss Measurement methods Memory trace Cache simulator Classifying miss Simulation methods

A practical low overhead solution

Overhead

>>

Accuracy

~

26

slide-99
SLIDE 99

Hardware-based address sampling (Cont.)

A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0]

Memory references

L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss L1 Miss

Exploiting Modern Hardware Features via Lightweight Profiling

Time

27

slide-100
SLIDE 100

Hardware-based address sampling (Cont.)

A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0]

Memory references

L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss L1 Miss

Exploiting Modern Hardware Features via Lightweight Profiling

Time A[0][0] A[4][0] A[2][0]

Precise event sampling (PEBS) PMU

27

slide-101
SLIDE 101

Hardware-based address sampling (Cont.)

A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0]

Memory references

L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss L1 Miss

Exploiting Modern Hardware Features via Lightweight Profiling

TAG SET Index Offset Time A[0][0] A[4][0] A[2][0]

Precise event sampling (PEBS) PMU

27

slide-102
SLIDE 102

Hardware-based address sampling (Cont.)

A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0]

Memory references

L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss L1 Miss

Exploiting Modern Hardware Features via Lightweight Profiling

TAG SET Index Offset S0 S4 S2 Time A[0][0] A[4][0] A[2][0]

Precise event sampling (PEBS) PMU

27

slide-103
SLIDE 103

Hardware-based address sampling (Cont.)

A[0][0] A[1][0] A[4][0] A[0][0] A[1][0] A[2][0] … A[0][0]

Memory references

L1 Miss L1 Miss L1 Miss L1 Hit L1 Hit L1 Miss L1 Miss

Exploiting Modern Hardware Features via Lightweight Profiling

TAG SET Index Offset S0 S4 S2 Time A[0][0] A[4][0] A[2][0]

Precise event sampling (PEBS) PMU

Set conflict?

27

slide-104
SLIDE 104

Observation: temporal pa

pattern of con

  • nflict

Exploiting Modern Hardware Features via Lightweight Profiling

… … … … …

[0] [1] [2] [127] [0] [1] [2] [20,000]

Array [20,000][128]

16 32 48 16 32 48 17 34 51 4 21 48 …

Set mapping

1 15 16 17 31 32 33 47 48 49 63 1 15 16 17 31 48 49 63

128

Set mapping after padding

1 15 16 17 32 33 49 51 52 2 4 5 19 21 22 36 55 56 6 18 34 35 50 20 37 7 3

Pad

128

Time Time

28

slide-105
SLIDE 105

Observation: tem

empora ral pattern rn of conflic lict (cont.)

Exploiting Modern Hardware Features via Lightweight Profiling

16 32 48 16 32 48 17 34 51 4 21 48

… Conflict No conflict

Time Time

29

slide-106
SLIDE 106

Observation: tem

empora ral pattern rn of conflic lict (cont.)

Exploiting Modern Hardware Features via Lightweight Profiling

16 32 48 16 32 48 17 34 51 4 21 48

… Conflict No conflict

Time Time

29

slide-107
SLIDE 107

Observation: tem

empora ral pattern rn of conflic lict (cont.)

Exploiting Modern Hardware Features via Lightweight Profiling

16 32 48 16 32 48 17 34 51 4 21 48

Distance = 3 Distance = 3

Conflict No conflict

Time Time

29

slide-108
SLIDE 108

Observation: tem

empora ral pattern rn of conflic lict (cont.)

Exploiting Modern Hardware Features via Lightweight Profiling

16 32 48 16 32 48 17 34 51 4 21 48

Distance = 3 Distance = 3 Distance = 63

Conflict No conflict

Time Time

29

slide-109
SLIDE 109

Re-conflict Distance (RCD)

  • Number of cache misses in other cache sets between two

consecutive misses in one particular set

S1 S2 S0 S1 S3 S2 S1 S0 RCD=2 RCD=2

Exploiting Modern Hardware Features via Lightweight Profiling

Set Misses

Time

30

slide-110
SLIDE 110

Re-conflict Distance (RCD)

  • Number of cache misses in other cache sets between two

consecutive misses in one particular set

S1 S2 S0 S1 S3 S2 S1 S0 RCD=2 RCD=2

Exploiting Modern Hardware Features via Lightweight Profiling

Set Misses

Time

PMU

S1 S1 S2 S3

RCD=0

Time

30

slide-111
SLIDE 111

Re-conflict Distance (RCD)

  • Number of cache misses in other cache sets between two

consecutive misses in one particular set

S1 S2 S0 S1 S3 S2 S1 S0 RCD=2 RCD=2

Exploiting Modern Hardware Features via Lightweight Profiling

Set Misses

Time

PMU

S1 S1 S2 S3

RCD=0

Time

Approximate RCD

30

slide-112
SLIDE 112

RCD and it’s contribution

RCD Count Is conflict? Short Large Yes Short Small No Long ~ No

Exploiting Modern Hardware Features via Lightweight Profiling 31

slide-113
SLIDE 113

RCD and it’s contribution

RCD Count Is conflict? Short Large Yes Short Small No Long ~ No

Exploiting Modern Hardware Features via Lightweight Profiling

Regression model Benchmark RCD Application RCD Model

Conflict No-Conflict

Prediction Training

31

slide-114
SLIDE 114

Case Study: PolyBench/C -ADI

Exploiting Modern Hardware Features via Lightweight Profiling

CCPROF PREDICTS >>> *** CONFLICT MISS *** in LOOP(line: 102). Loop contribution is *** HIGH *** 94.26% CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 108). Loop's contribution to total L1 miss: 3.13% CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 117). Loop's contribution to total L1 miss: 0.86% CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 122). Loop's contribution to total L1 miss: 1.74%

32

slide-115
SLIDE 115

Case Study: PolyBench/C -ADI

Exploiting Modern Hardware Features via Lightweight Profiling

CCPROF PREDICTS >>> *** CONFLICT MISS *** in LOOP(line: 102). Loop contribution is *** HIGH *** 94.26% CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 108). Loop's contribution to total L1 miss: 3.13% CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 117). Loop's contribution to total L1 miss: 0.86% CCPROF PREDICTS >>> *** NO CONFLICT MISS *** in loop(line: 122). Loop's contribution to total L1 miss: 1.74%

32

slide-116
SLIDE 116

RCD – bef

before and aft after optimization

Speedup: 3× Speedup: 1.26× Speedup: 1.13× Speedup: 1.09×* Speedup: 94.6×* Speedup: 1.12×

*Loop level Speedup

Exploiting Modern Hardware Features via Lightweight Profiling 33

slide-117
SLIDE 117

RCD – bef

before and aft after optimization

Speedup: 3× Speedup: 1.26× Speedup: 1.13× Speedup: 1.09×* Speedup: 94.6×* Speedup: 1.12×

*Loop level Speedup

Median overhead: 37%

Compare with simulation: 38x

Exploiting Modern Hardware Features via Lightweight Profiling 33

slide-118
SLIDE 118

Outline

✓Lightweight profiling ✓SMT-aware optimization ✓Detection of cache conflicts

  • Guiding data-structure layout transformation

Exploiting Modern Hardware Features via Lightweight Profiling 34

slide-119
SLIDE 119

StructSlim: A lightweight profiler to guide structure splitting

Probir Roy , Xu Liu [CGO – 2016]

LWPTool: A Lightweight Profiler to Guide Data Layout Optimization

Chao Yu, Probir Roy, Yuebin Bai, Hailong Yang, Xu Liu

[TPDS – 2018]

Exploiting Modern Hardware Features via Lightweight Profiling 35

slide-120
SLIDE 120

StructSlim: A lightweight profiler to guide structure splitting

Probir Roy , Xu Liu [CGO – 2016]

LWPTool: A Lightweight Profiler to Guide Data Layout Optimization

Chao Yu, Probir Roy, Yuebin Bai, Hailong Yang, Xu Liu

[TPDS – 2018]

Exploiting Modern Hardware Features via Lightweight Profiling 35

slide-121
SLIDE 121

Inefficient data-structure

struct type {int a; int b; int c; int d;}; struct type Arr[N]; for (i = 0; i < N; i++) B[i] = Arr[i].a + Arr[i].c;

a b c d a b c d

L1 cache

Exploiting Modern Hardware Features via Lightweight Profiling

Cache line

36

slide-122
SLIDE 122

Inefficient data-structure

struct type {int a; int b; int c; int d;}; struct type Arr[N]; for (i = 0; i < N; i++) B[i] = Arr[i].a + Arr[i].c;

a b c d a b c d

L1 cache

Exploiting Modern Hardware Features via Lightweight Profiling

Cache line

Utilization = 50%

36

slide-123
SLIDE 123

Inefficient data-structure

struct type {int a; int b; int c; int d;}; struct type Arr[N]; for (i = 0; i < N; i++) B[i] = Arr[i].a + Arr[i].c;

a b c d a b c d

L1 cache

Exploiting Modern Hardware Features via Lightweight Profiling

Cache line

a c a c a c a c

L1 cache

struct type_part1 {int a; int c;}; struct type_part2 {int b; int d;};

Split structure

Cache line

Utilization = 50%

36

slide-124
SLIDE 124

Inefficient data-structure

struct type {int a; int b; int c; int d;}; struct type Arr[N]; for (i = 0; i < N; i++) B[i] = Arr[i].a + Arr[i].c;

a b c d a b c d

L1 cache

Exploiting Modern Hardware Features via Lightweight Profiling

Cache line

a c a c a c a c

L1 cache

struct type_part1 {int a; int c;}; struct type_part2 {int b; int d;};

Split structure

Cache line

Utilization = 50% Utilization = 100%

36

slide-125
SLIDE 125

Structure splitting- Questions to ask

Exploiting Modern Hardware Features via Lightweight Profiling

How to split structure? Which data structures are significant? Which fields to keep together?

37

slide-126
SLIDE 126

Structure splitting- Questions to ask

Exploiting Modern Hardware Features via Lightweight Profiling

How to split structure? Which data structures are significant? High usage Which fields to keep together?

37

slide-127
SLIDE 127

Structure splitting- Questions to ask

Exploiting Modern Hardware Features via Lightweight Profiling

How to split structure? Which data structures are significant? High usage Which fields to keep together? Loop level analysis Field affinity

37

slide-128
SLIDE 128

Structure splitting- Questions to ask

Exploiting Modern Hardware Features via Lightweight Profiling

How to split structure? Which data structures are significant? High usage Which fields to keep together? Loop level analysis Field affinity

Field 1 Field 2 Field 3

Loop 1 Loop 2

37

slide-129
SLIDE 129

Structure splitting- Questions to ask

Exploiting Modern Hardware Features via Lightweight Profiling

How to split structure? Which data structures are significant? High usage Which fields to keep together? Loop level analysis Field affinity

Field 1 Field 2 Field 3

Loop 1 Loop 2

Field 1 Field 2 Field 2 Field 3

  • r

37

slide-130
SLIDE 130

Structure splitting- Questions to ask

Exploiting Modern Hardware Features via Lightweight Profiling

How to split structure? Which data structures are significant? High usage Which fields to keep together? Loop level analysis Field affinity

Field 1 Field 2 Field 3

Loop 1 Loop 2

Field 1 Field 2 Field 2 Field 3

  • r

90% 10%

37

slide-131
SLIDE 131

Structure splitting- Questions to ask

Exploiting Modern Hardware Features via Lightweight Profiling

How to split structure? Which data structures are significant? High usage Which fields to keep together? Loop level analysis Field affinity

Field 1 Field 2 Field 3

Loop 1 Loop 2

Field 1 Field 2 Field 2 Field 3

  • r

90% 10%

37

slide-132
SLIDE 132

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling 38

slide-133
SLIDE 133

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling

PMU

38

slide-134
SLIDE 134

S S S S

Application Time

Reference Type Data Address Instruction Pointer

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling

PMU

38

slide-135
SLIDE 135

S S S S

Application Time

Reference Type Data Address Instruction Pointer

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling

PMU Loop 1 Loop 2

38

slide-136
SLIDE 136

S S S S

Application Time

Reference Type Data Address Instruction Pointer

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling

PMU Loop 1 Loop 2

Mem Allocation Monitor

38

slide-137
SLIDE 137

S S S S

Application Time

Reference Type Data Address Instruction Pointer

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling

PMU Loop 1 Loop 2

Heap

Mem Allocation Monitor

38

slide-138
SLIDE 138

S S S S

Application Time

Reference Type Data Address Instruction Pointer

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling

PMU Loop 1 Loop 2

Heap

Mem Allocation Monitor

38

slide-139
SLIDE 139

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling

PMU Loop 1 Loop 2

Heap

Struct 1 Struct 2

S S S S S S S S S

= sample

39

slide-140
SLIDE 140

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling

S

= sample

PMU

Heap Struct 1

S S S S S

Distances

40

slide-141
SLIDE 141

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling

S

= sample

PMU

Heap Struct 1

S S S S S

Distances

Distances Field offset

40

slide-142
SLIDE 142

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling

S

= sample

PMU

Heap Struct 1

S S S S S

Distances

Field 1 Field 2 Field 3

Loop 1 Loop 2

Distances Field offset

40

slide-143
SLIDE 143

Code-centric and data-centric attribution

Exploiting Modern Hardware Features via Lightweight Profiling

S

= sample

PMU

Heap Struct 1

S S S S S

Distances

Field 1 Field 2 Field 3

Loop 1 Loop 2

Distances Field offset

90% 10%

40

slide-144
SLIDE 144

Case study: SPEC CPU 2000 ART

Loops with line numbers Latency percentage Accessed fields 131-138 1.59% U,P 559-570 8.42% X,Q 553-554 1.98% W 545-548 10.83% U, I 615-616 56.57% P 607-608 14.40% P 589-592 2.25% U, P 575-576 3.72% V 1015-1016 0.24% I

Exploiting Modern Hardware Features via Lightweight Profiling

typedef struct { double *I; double W; double X; double V; double U; double P; double Q; double R; }f1_neuron U I P Q X R W V 86% 5% 100%

Affinity graph

41

slide-145
SLIDE 145

Case study: SPEC CPU 2000 ART

Loops with line numbers Latency percentage Accessed fields 131-138 1.59% U,P 559-570 8.42% X,Q 553-554 1.98% W 545-548 10.83% U, I 615-616 56.57% P 607-608 14.40% P 589-592 2.25% U, P 575-576 3.72% V 1015-1016 0.24% I

Exploiting Modern Hardware Features via Lightweight Profiling

typedef struct { double *I; double W; double X; double V; double U; double P; double Q; double R; }f1_neuron U I P Q X R W V 86% 5% 100%

Affinity graph

typedef struct{ double *I; double U;} f1_neuron_IU; typedef struct{ double Q; double X;} f1_neuron_QX; typedef struct{ double P;} f1_neuron_P; typedef struct{ double V;} f1_neuron_V; typedef struct{ double W;} f1_neuron_W; typedef struct{ double R;} f1_neuron_R;

41

slide-146
SLIDE 146

Benchmarks: speedup, overhead, cache miss

Benchmarks Speedups Runtime

  • verhead

L1 miss reduction L2 miss reduction 179.ART 1.37× 2.05% 46.5% 51.1% 462.Libquantum 1.09× 2.79% 49% 82.6% TSP 1.09× 2.42% 13.3% 19.9% Mser 1.03× 2.95% 8.3% 8.4% CLOMP 1.2 1.25× 16.1% 15.5% 26.4% Health 1.12× 18.3% 66.7% 90.8% NN 1.33× 5.21% 87.2% 98.0% Average 1.18× 7.1%

Exploiting Modern Hardware Features via Lightweight Profiling

gcc -O3

Yan, Jianian, Jiangzhou He, Wenguang Chen, Pen-Chung Yew, and Weimin Zheng. "ASLOP: A field-access affinity-based structure data layout optimizer."

Related work: Overhead: average 4x

42

slide-147
SLIDE 147

Conclusions

Exploiting Modern Hardware Features via Lightweight Profiling

Low overhead Deep insight

Simulation methods (PinTool, GPGPUSim, GEMS) Measurement methods (Perf, Oprofile, PAPI)

Shallow insight High overhead SMTAnalyzer StructSlim CCProf

Lightweight profiling with PMUs can provide deep insights into performance issues cause by memory hierarchies and poor algorithm choice.

43

slide-148
SLIDE 148

Publications

  • [CGO'18] "Lightweight Detection of Cache Conflicts", Probir Roy, Shuaiwen Leon Song, Sriram

Krishnamoorthy and Xu Liu, The 2018 International Symposium on Code Generation and Optimization, Feb 24 - 28th, 2018, Vienna, Austria. Acceptance ratio: 28%.

  • [TACO'18] "NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks", Probir Roy, Shuaiwen

Leon Song, Sriram Krishnamoorthy, Abhinav Vishnu, Dipanjan Sengupta, Xu Liu, ACM Transactions

  • n Architecture and Code Optimization, 2018.
  • [TPDS'18] "LWPTool: A Lightweight Profiler to Guide Data Layout Optimization", Chao Yu, Probir

Roy, Yuebin Bai, Hailong Yang, Xu Liu, IEEE Transactions on Parallel and Distributed Systems, 2018.

  • [HPDC'16] "SMT-Aware Instantaneous Footprint Optimization", Probir Roy, Xu Liu and Shuaiwen

Leon Song, The 25th ACM International Symposium on High-Performance and Distributed Computing, May 31 - Jun 4, 2016, Kyoto, Japan. Acceptance ratio: 15.5% (20/129).

  • [CGO'16] "StructSlim: A Lightweight Profiler to Guide Structure Splitting", Probir Roy and Xu Liu,

The 2016 International Symposium on Code Generation and Optimization, Mar 12-18, 2016, Barcelona, Spain. Acceptance ratio: 23%.

Exploiting Modern Hardware Features via Lightweight Profiling 44

slide-149
SLIDE 149

Publications

  • [CGO'18] "Lightweight Detection of Cache Conflicts", Probir Roy, Shuaiwen Leon Song, Sriram

Krishnamoorthy and Xu Liu, The 2018 International Symposium on Code Generation and Optimization, Feb 24 - 28th, 2018, Vienna, Austria. Acceptance ratio: 28%.

  • [TACO'18] "NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks", Probir Roy, Shuaiwen

Leon Song, Sriram Krishnamoorthy, Abhinav Vishnu, Dipanjan Sengupta, Xu Liu, ACM Transactions

  • n Architecture and Code Optimization, 2018.
  • [TPDS'18] "LWPTool: A Lightweight Profiler to Guide Data Layout Optimization", Chao Yu, Probir

Roy, Yuebin Bai, Hailong Yang, Xu Liu, IEEE Transactions on Parallel and Distributed Systems, 2018.

  • [HPDC'16] "SMT-Aware Instantaneous Footprint Optimization", Probir Roy, Xu Liu and Shuaiwen

Leon Song, The 25th ACM International Symposium on High-Performance and Distributed Computing, May 31 - Jun 4, 2016, Kyoto, Japan. Acceptance ratio: 15.5% (20/129).

  • [CGO'16] "StructSlim: A Lightweight Profiler to Guide Structure Splitting", Probir Roy and Xu Liu,

The 2016 International Symposium on Code Generation and Optimization, Mar 12-18, 2016, Barcelona, Spain. Acceptance ratio: 23%.

Exploiting Modern Hardware Features via Lightweight Profiling

?

?

? ?

44

slide-150
SLIDE 150

Challenges ahead

  • Program analysis for declarative programming languages
  • Domain specific languages provide high-level abstraction
  • Machine learning (PyTorch), HPC (HDF5), big-data (SQL)
  • Analyzing and optimizing data center and cloud application
  • Resource utilization/scheduling in multi-tenant environment
  • Heterogenous architecture resource management
  • Security analysis
  • Program analysis to identify vulnerable source code
  • Analysis of emerging hardware
  • GPU, FPGA, Tensor processing units

Exploiting Modern Hardware Features via Lightweight Profiling 45