2017-01-12 1
Automatic Identifjcation and Precise Attribution of DRAM Bandwidth Contention Christian Helm and Kenjiro T
aura
The University of T
- kyo
Automatic Identifjcation and Precise Attribution of DRAM Bandwidth - - PowerPoint PPT Presentation
Automatic Identifjcation and Precise Attribution of DRAM Bandwidth Contention Christian Helm and Kenjiro T aura The University of T okyo 2017-01-12 1 Performance Optimization Applications rarely reach peak performance of hardware
2017-01-12 1
2
3
– More demand than resources available – Multiple cores compete over the single bandwidth resource
–
The application gets everything it needs
–
All of the available resources are used
– The application requests
more than the DRAM can provide
4
Node 1 Core Core Core Core Node 0 Core Core Core Core DRAM DRAM Node 3 Core Core Core Core DRAM Node 2 Core Core Core Core DRAM
5
– Difgerentiation from harmless high bandwidth – Precise attribution to instructions and objects – Contention severity metric – NUMA imbalance severity metric – Practical
2017-01-12 6
7
– Does not identify contention – No precise attribution
[Liu et al. SC 2013] [Liu et al. PPoPP 2014] [Liu et al. SC 15]
– Does not identify contention – Precise attribution through instruction sampling
[Eklov et al. CGO 2013] [Eyerman ISPASS 2012]
– Performance counters that identify bandwidth boundness
and exclude other problems
–
No precise attribution
–
Machine learning approach based on latency and other features
–
Only NUMA remote memory, no local memory contention, no single socket systems
–
Severity can not be quantifjed
8
– No quantifjcation of imbalance
– Standard deviation of load across nodes
2017-01-12 9
10
Arrivals Application
DRAM
Application
DRAM
Arrivals
11
ID IP Data Address Latency Memory Level TLB Locked a aa 50 L2 hit No 1 b bb 330 DRAM miss No 2 c cc 600 DRAM hit Yes 3 d dd 300 DRAM hit No 4 e ee 290 DRAM hit No
On The Correct Measurement of Application Memory Bandwidth and Memory Access Latency, HPC Asia 2020
12
–
–
13
ID CPU ID Memory Level Local DRAM 1 Remote DRAM 3 1 Local DRAM 4 1 Local DRAM
14
Run Script Linux Perf Allocation Tracker Analyzer Sqlite Database Profjled Application Data Merger
2017-01-12 15
16
17
– Benchmark to create an adjustable amount of contention – A quantifjcation of the severity of contention
18
–
Countv
–
Dotv
–
Sumv
wo data sizes
–
Smaller than L3 cache
bandwidth is no limitation
–
Larger than L3 cache
has an impact
–
More threads will increase the bandwidth requirement
19
Speedup (Large array version)
20
21
22
23
24
25
2017-01-12 26
27
– Neural machine translation
– Implemented using Eigen library [http://eigen.tuxfamily.org]
28
1 2 3 4 1 2 3 4
1 2 3 4 5 streamcluster canneal n3lp Relative Latency Benchmark arcturus comet rigel spica contention
29
0.2 0.4 0.6 0.8 1 streamcluster canneal n3lp NUMA Imbalance arcturus comet rigel spica
30
10 20 30 40 50 60 70 streamcluster canneal n3lp Speedup % arcturus comet rigel spica
1 2 3 4 4
➔ Large Speedup
➔ Speedup only on Spica
➔ No or low speedup
31
32
– Include more hardware related reasons of DRAM contention