Automatic Identifjcation and Precise Attribution of DRAM Bandwidth - PowerPoint PPT Presentation

Automatic Identifjcation and Precise Attribution of DRAM Bandwidth Contention Christian Helm and Kenjiro T aura The University of T okyo 2017-01-12 1

Performance Optimization ● Applications rarely reach peak performance of hardware ● Memory bandwidth is a bottleneck for many applications ● The problems originate from the interaction of software and hardware 2

DRAM Contention ● DRAM contention – More demand than resources available – Multiple cores compete over the single bandwidth resource ● Consumed bandwidth can not identify contention ● An example: An application uses 95% of the available bandwidth  Good (No contention) ☹ Bad (Contention) The application gets everything – The application requests – it needs more than the DRAM can All of the available resources are provide – used 3

NUMA Systems Node 0 Node 1 Core Core Core Core ● Higher aggregated bandwidth Core Core Core Core ● Requires usage of all DRAMs DRAM DRAM Node 2 Node 3 Core Core Core Core Core Core Core Core DRAM DRAM 4

Contributions ● A method to identify DRAM contention and imbalanced NUMA resource usage – Difgerentiation from harmless high bandwidth – Precise attribution to instructions and objects – Contention severity metric – NUMA imbalance severity metric – Practical Lightweight profjler ● Single socket and NUMA systems supported ● Runs on default Linux OS ● Works on unmodifjed code with debug information ● 5

Contents ● Introduction ● Related Work ● Contention Detection Method ● Evaluation ● Case Studies ● Conclusion 2017-01-12 6

Related Work - Contention ● Measurement of consumed bandwidth [Intel PCM] [Intel Vtune] [Weyers et al. VPA 2014] – Does not identify contention – No precise attribution ● Latency as indicator for memory problems [Lachaize et al. USENIX ATC 2012] [Liu et al. SC 2013] [Liu et al. PPoPP 2014] [Liu et al. SC 15] – Does not identify contention – Precise attribution through instruction sampling ● Performance counter based detection [Yasin ISPASS 2014] [Molka et al. ICPE 2017] [Eklov et al. CGO 2013] [Eyerman ISPASS 2012] – Performance counters that identify bandwidth boundness and exclude other problems No precise attribution – ● Instruction sampling based [Xu et al. IPDPS 2017] Machine learning approach based on latency and other features – Only NUMA remote memory , no local memory contention, no single socket systems – Severity can not be quantifjed – 7

Related Work - NUMA ● Show the location of allocation , fjrst touch , use of data [Liu et al. PPoPP 2014] – No quantifjcation of imbalance ● Visual detection of imbalance [Gimenez et al. SC 2014] [Gimenez et al. TVCG 2017] [Trahay et al. ICPP 2018] ● OS extension with imbalance metric [Fedorova et al. ASPLOS 2013] – Standard deviation of load across nodes 8

Relation of Bandwidth And Latency ● Known in queuing theory No queuing delay Only In-store processing time Arrivals Queuing delay + Arrivals In-store processing time ● Application to DRAM Only DRAM processing time Application DRAM Queuing delay + Application DRAM DRAM processing time 10

Latency of an Application ● Application latency measurement with instruction sampling ● Selected samples Exclude Caches ID IP Data Latency Memory TLB Locked Address Level 0 a aa 50 L2 hit No Exclude TLB Miss 1 b bb 330 DRAM miss No 2 c cc 600 DRAM hit Yes 3 d dd 300 DRAM hit No Exclude Atomic Access 4 e ee 290 DRAM hit No C. Helm and K. Taura, Average latency of On The Correct Measurement of Application Memory Bandwidth and Memory Access Latency , at least 25 samples Precise attribution HPC Asia 2020 11

Relative Latency Metric ● System latency Uncontended DRAM access – Determined with pointer chasing benchmark – ● Relative latency = Application latency System latency ● Hardware independent severeness of DRAM contention ● Higher than one indicates contention problem 12

NUMA Imbalance Metric Local or remote access ID CPU ID Memory Level Origin node 0 0 Local DRAM 1 0 Remote DRAM 3 1 Local DRAM 4 1 Local DRAM ● Local ratio (per node) = Number of local accesses Number of total DRAM accesses ● Numa Imbalance = max (local ratio) – min (local ratio) ● 1 → High imbalance ● 0 → Low imbalance 13

Profjling T ool Implementation Allocation Tracker Data Sqlite Analyzer Run Script Merger Database Linux Perf Profjled Application PerfMemPlus is available online: https://github.com/helchr/PerfMemPlus 14

Hardware Setup Name Architecture CPUs DRAM Bandwidth 2 Arcturus Broadwell 2x E5-2699v4 43GB/s 3 Comet Haswell 2x E5-2699v3 32GB/s Rigel Skylake 2x Xeon 8176 77GB/s 1 Spica Broadwell 4x E7-8890v4 25GB/s 4 16

Experiment Design ● A defjned amount of contention to compare with the detection results – Benchmark to create an adjustable amount of contention – A quantifjcation of the severity of contention 17

Adjustable Contention Benchmarks ● Simple memory intensive parallel vector operations [Xu et al. IPDPS 2017] Countv – Dotv – Sumv – ● T wo data sizes Smaller than L3 cache – Optimal case when DRAM ● bandwidth is no limitation Larger than L3 cache – The DRAM bandwidth limitation ● has an impact ● Variable number of threads More threads will increase the bandwidth requirement – 18

Contention Quantifjcation ● Speedup Loss = Speedup (Small array version) Speedup (Large array version) ● Expresses the severity of the DRAM contention 19

Detection Results ● Each experiment is repeated 10 times ● The percentage of correct detection is recorded Upper boundary of speedup loss interval 20

Advantage of Latency over Bandwidth ● Compare information from three sources – Bandwidth – Latency – Speedup loss 21

Information From Bandwidth All benchmarks suffer from limited DRAM bandwidth 22

Information From Latency DRAM access latency in uncontended state 23

Information From Latency DRAM contention problem differs between benchmarks 24

True Information DRAM contention problem differs between benchmarks 25

Applications ● All 13 PARSEC benchmarks ● N3LP – Neural machine translation [Eriguchi et al., WAT, 2016] [https://github.com/hassyGo/N3LP] – Implemented using Eigen library [http://eigen.tuxfamily.org] 27

Bandwidth Contention Details 5 arcturus comet 4 rigel Relative Latency spica 3 contention 4 2 4 2 3 2 3 1 1 1 0 streamcluster canneal n3lp Benchmark 28

NUMA Imbalance 1 arcturus 0.8 comet NUMA Imbalance rigel NUMA imbalance problem 0.6 spica 0.4 0.2 0 streamcluster canneal n3lp Small NUMA imbalance problem 29

Interleaved Allocation Speedup ● High relative latency ● High relative latency only on Spica ● High NUMA imbalance ● High NUMA imbalance on all systems ● Interleave allocation ➔ Speedup only on Spica ➔ Large Speedup 4 70 arcturus 60 Speedup % 4 3 50 comet 2 40 rigel 30 spica 20 1 10 0 ● High relative latency streamcluster canneal n3lp ● Low NUMA imbalance ➔ No or low speedup 30

Profjling Overhead (PARSEC) 31

Conclusion A new method to identify DRAM contention Relative NUMA Performance Problem Latency Imbalance Low Any No DRAM contention problem High Low Contention but not NUMA related High High Contention due to inefficient NUMA usage ● Future work – Include more hardware related reasons of DRAM contention 32

Automatic Identifjcation and Precise Attribution of DRAM Bandwidth - PowerPoint PPT Presentation

Automatic Identifjcation and Precise Attribution of DRAM Bandwidth Contention Christian Helm and Kenjiro T aura The University of T okyo 2017-01-12 1 Performance Optimization Applications rarely reach peak performance of hardware

Large Scale DRAM Model DRAM Engineers DRAM Engineers Team: Abdulrahman Alqahtani,

Identifjcation analysis and higher-order approximation of DSGE models Willi Mutschler 1

MQTT Protocol for Real Time GNSS Data and Correction Distribution Precise Positioning Precise

Precise Performance LTD Jake Yarranton jake@precise-performance.co.uk 07468 465754 Precise

Virtual Memory Lecture 25 CS301 DRAM as cache What about programs larger than DRAM?

COMP 590-154: Computer Architecture Memory / DRAM SRAM vs. DRAM SRAM = Static RAM As

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Precise Garbage Collection in C PANKHURI February 16, 2011 Agenda Problem Statement. Precise /

Optimal Prices in the Towards a Precise . . . Towards a Precise . . . Presence of Discounts:

Main Memory and DRAM Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture

Main Memory and DRAM Nima Honarmand Spring 2016 :: CSE 502 Computer Architecture SRAM vs.

Viyojit: Decoupling Battery and DRAM Capacities for Battery-Backed DRAM Rajat Kateja # Anirudh

2018 2019 Demand Response Auction Mechanism ( DRAM DRAM 3) 3) Pre Bi Pre Bid

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit

DRAM CONTROLLER Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

Kilo Instruction Processors Adrin Cristal 2/7/2019 YALE 80 Processor-DRAM Gap (latency)

28. Parallel Programming II C++ Threads, Shared Memory, Concurrency, Excursion: lock algorithm

Parallel Programming and High-Performance Computing Part 3: Foundations Dr. Ralf-Peter Mundani

Linux Overview Amir Hossein Payberah payberah@gmail.com 1 Agenda Linux Overview Linux

Synchronization and Communication Making processes/threads work together Computadores II /

Automatic Generation of I/O Kernels for HPC Applications Babak Behzad 1 , Hoang-Vu Dang 1 , Farah

Automatically Identifying Automatically Identifying and Georeferencing Georeferencing and

Automatic Job Submission Simon Albright Three classes: SSHConnection Handles sending and

Statistical Identification of English Loanwords in Korean Using Automatically Generated Training