Automatic Identifjcation and Precise Attribution of DRAM Bandwidth - - PowerPoint PPT Presentation

automatic identifjcation and precise attribution of dram
SMART_READER_LITE
LIVE PREVIEW

Automatic Identifjcation and Precise Attribution of DRAM Bandwidth - - PowerPoint PPT Presentation

Automatic Identifjcation and Precise Attribution of DRAM Bandwidth Contention Christian Helm and Kenjiro T aura The University of T okyo 2017-01-12 1 Performance Optimization Applications rarely reach peak performance of hardware


slide-1
SLIDE 1

2017-01-12 1

Automatic Identifjcation and Precise Attribution of DRAM Bandwidth Contention Christian Helm and Kenjiro T

aura

The University of T

  • kyo
slide-2
SLIDE 2

2

Performance Optimization

  • Applications rarely reach peak performance of

hardware

  • Memory bandwidth is a bottleneck for many

applications

  • The problems originate from the interaction of

software and hardware

slide-3
SLIDE 3

3

DRAM Contention

  • DRAM contention

– More demand than resources available – Multiple cores compete over the single bandwidth resource

  • Consumed bandwidth can not identify contention
  • An example: An application uses 95% of the available

bandwidth

 Good (No contention)

The application gets everything it needs

All of the available resources are used

☹ Bad (Contention)

– The application requests

more than the DRAM can provide

slide-4
SLIDE 4

4

NUMA Systems

  • Higher aggregated bandwidth
  • Requires usage of all DRAMs

Node 1 Core Core Core Core Node 0 Core Core Core Core DRAM DRAM Node 3 Core Core Core Core DRAM Node 2 Core Core Core Core DRAM

slide-5
SLIDE 5

5

Contributions

  • A method to identify DRAM contention and imbalanced NUMA

resource usage

– Difgerentiation from harmless high bandwidth – Precise attribution to instructions and objects – Contention severity metric – NUMA imbalance severity metric – Practical

  • Lightweight profjler
  • Single socket and NUMA systems supported
  • Runs on default Linux OS
  • Works on unmodifjed code with debug information
slide-6
SLIDE 6

2017-01-12 6

Contents

  • Introduction
  • Related Work
  • Contention Detection Method
  • Evaluation
  • Case Studies
  • Conclusion
slide-7
SLIDE 7

7

Related Work - Contention

  • Measurement of consumed bandwidth [Intel PCM] [Intel Vtune] [Weyers et al. VPA 2014]

– Does not identify contention – No precise attribution

  • Latency as indicator for memory problems [Lachaize et al. USENIX ATC 2012]

[Liu et al. SC 2013] [Liu et al. PPoPP 2014] [Liu et al. SC 15]

– Does not identify contention – Precise attribution through instruction sampling

  • Performance counter based detection [Yasin ISPASS 2014] [Molka et al. ICPE 2017]

[Eklov et al. CGO 2013] [Eyerman ISPASS 2012]

– Performance counters that identify bandwidth boundness

and exclude other problems

No precise attribution

  • Instruction sampling based [Xu et al. IPDPS 2017]

Machine learning approach based on latency and other features

Only NUMA remote memory, no local memory contention, no single socket systems

Severity can not be quantifjed

slide-8
SLIDE 8

8

Related Work - NUMA

  • Show the location of allocation, fjrst touch, use of

data [Liu et al. PPoPP 2014]

– No quantifjcation of imbalance

  • Visual detection of imbalance

[Gimenez et al. SC 2014] [Gimenez et al. TVCG 2017] [Trahay et al. ICPP 2018]

  • OS extension with imbalance metric

[Fedorova et al. ASPLOS 2013]

– Standard deviation of load across nodes

slide-9
SLIDE 9

2017-01-12 9

Contents

  • Introduction
  • Related Work
  • Contention Detection Method
  • Evaluation
  • Case Studies
  • Conclusion
slide-10
SLIDE 10

10

Relation of Bandwidth And Latency

  • Known in queuing theory
  • Application to DRAM

Arrivals Application

DRAM

Application

DRAM

Arrivals

Queuing delay

+

DRAM processing time Only DRAM processing time Queuing delay

+

In-store processing time No queuing delay Only In-store processing time

slide-11
SLIDE 11

11

Latency of an Application

ID IP Data Address Latency Memory Level TLB Locked a aa 50 L2 hit No 1 b bb 330 DRAM miss No 2 c cc 600 DRAM hit Yes 3 d dd 300 DRAM hit No 4 e ee 290 DRAM hit No

  • Application latency measurement with instruction sampling
  • Selected samples

Exclude Caches Exclude TLB Miss Exclude Atomic Access Average latency of at least 25 samples Precise attribution

  • C. Helm and K. Taura,

On The Correct Measurement of Application Memory Bandwidth and Memory Access Latency, HPC Asia 2020

slide-12
SLIDE 12

12

Relative Latency Metric

  • System latency

Uncontended DRAM access

Determined with pointer chasing benchmark

  • Relative latency = Application latency

System latency

  • Hardware independent severeness of DRAM contention
  • Higher than one indicates contention problem
slide-13
SLIDE 13

13

NUMA Imbalance Metric

  • Local ratio (per node) = Number of local accesses

Number of total DRAM accesses

  • Numa Imbalance = max (local ratio) – min (local ratio)
  • 1 → High imbalance
  • 0 → Low imbalance

ID CPU ID Memory Level Local DRAM 1 Remote DRAM 3 1 Local DRAM 4 1 Local DRAM

Origin node Local or remote access

slide-14
SLIDE 14

14

Profjling T

  • ol Implementation

PerfMemPlus is available online: https://github.com/helchr/PerfMemPlus

Run Script Linux Perf Allocation Tracker Analyzer Sqlite Database Profjled Application Data Merger

slide-15
SLIDE 15

2017-01-12 15

Contents

  • Introduction
  • Related Work
  • Contention Detection Method
  • Evaluation
  • Case Studies
  • Conclusion
slide-16
SLIDE 16

16

Hardware Setup

Name Architecture CPUs DRAM Bandwidth Arcturus Broadwell 2x E5-2699v4 43GB/s Comet Haswell 2x E5-2699v3 32GB/s Rigel Skylake 2x Xeon 8176 77GB/s Spica Broadwell 4x E7-8890v4 25GB/s

1 2 3 4

slide-17
SLIDE 17

17

Experiment Design

  • A defjned amount of contention to compare with the

detection results

– Benchmark to create an adjustable amount of contention – A quantifjcation of the severity of contention

slide-18
SLIDE 18

18

Adjustable Contention Benchmarks

  • Simple memory intensive parallel vector operations [Xu et al. IPDPS 2017]

Countv

Dotv

Sumv

  • T

wo data sizes

Smaller than L3 cache

  • Optimal case when DRAM

bandwidth is no limitation

Larger than L3 cache

  • The DRAM bandwidth limitation

has an impact

  • Variable number of threads

More threads will increase the bandwidth requirement

slide-19
SLIDE 19

19

Contention Quantifjcation

  • Speedup Loss = Speedup (Small array version)

Speedup (Large array version)

  • Expresses the severity of the DRAM contention
slide-20
SLIDE 20

20

Detection Results

  • Each experiment is repeated 10 times
  • The percentage of correct detection is recorded

Upper boundary of speedup loss interval

slide-21
SLIDE 21

21

Advantage of Latency over Bandwidth

  • Compare information from three sources

– Bandwidth – Latency – Speedup loss

slide-22
SLIDE 22

22

Information From Bandwidth

All benchmarks suffer from limited DRAM bandwidth

slide-23
SLIDE 23

23

Information From Latency

DRAM access latency in uncontended state

slide-24
SLIDE 24

24

Information From Latency

DRAM contention problem differs between benchmarks

slide-25
SLIDE 25

25

True Information

DRAM contention problem differs between benchmarks

slide-26
SLIDE 26

2017-01-12 26

Contents

  • Introduction
  • Related Work
  • Contention Detection Method
  • Evaluation
  • Case Studies
  • Conclusion
slide-27
SLIDE 27

27

Applications

  • All 13 PARSEC benchmarks
  • N3LP

– Neural machine translation

[Eriguchi et al., WAT, 2016] [https://github.com/hassyGo/N3LP]

– Implemented using Eigen library [http://eigen.tuxfamily.org]

slide-28
SLIDE 28

28

Bandwidth Contention Details

1 2 3 4 1 2 3 4

1 2 3 4 5 streamcluster canneal n3lp Relative Latency Benchmark arcturus comet rigel spica contention

slide-29
SLIDE 29

29

NUMA Imbalance

0.2 0.4 0.6 0.8 1 streamcluster canneal n3lp NUMA Imbalance arcturus comet rigel spica

Small NUMA imbalance problem NUMA imbalance problem

slide-30
SLIDE 30

30

Interleaved Allocation Speedup

10 20 30 40 50 60 70 streamcluster canneal n3lp Speedup % arcturus comet rigel spica

1 2 3 4 4

  • High relative latency
  • High NUMA imbalance
  • Interleave allocation

➔ Large Speedup

  • High relative latency only on Spica
  • High NUMA imbalance on all systems

➔ Speedup only on Spica

  • High relative latency
  • Low NUMA imbalance

➔ No or low speedup

slide-31
SLIDE 31

31

Profjling Overhead (PARSEC)

slide-32
SLIDE 32

32

Conclusion

A new method to identify DRAM contention

Relative Latency NUMA Imbalance Performance Problem Low Any No DRAM contention problem High Low Contention but not NUMA related High High Contention due to inefficient NUMA usage

  • Future work

– Include more hardware related reasons of DRAM contention