Leveraging Hardware Address Sampling ! Beyond Data Collection and - PowerPoint PPT Presentation

Leveraging Hardware Address Sampling ! Beyond Data Collection and Attribution Xu Liu ! Department of Computer Science College of William and Mary xl10@cs.wm.edu

Motivation: Memory is the Bottleneck NUMA: Non-Uniform Memory Access core core core core QuickPath HyperTransport cache cache remote local access access memory memory 2

Memory Bottleneck Optimization spatial locality temporal cache miss locality 0 1 0 1 NUMA locality 2 3 2 3 3

State of the Arts simulation methods deep insights weaknesses: ! • 2-5x overhead ! low overhead with deep insights • not real machines deep insights with low overhead measurement methods low overhead 4

Hardware Address Sampling Features of address sampling • – necessary features • sample memory-related events (memory accesses, NUMA events) • capture effective addresses • record precise IP of sampled instructions or events – optional features • record useful metrics: data access latency (in CPU cycle) • sample instructions/events not related to memory Support in modern processors • • AMD Opteron 10h and above: instruction-based sampling (IBS) • IBM POWER 5 and above: marked event sampling (MRK) • Intel Itanium 2: data event address register sampling (DEAR) • Intel Pentium 4 and above: precise event based sampling (PEBS) • Intel Nehalem and above: PEBS with load latency (PEBS-LL) 5

Tools Based on Address Sampling Measurement methods • – temporal/spatial locality • HPCToolkit, Cache Scope – NUMA locality • Memphis, MemProf, HPCToolkit Features • – lightweight performance data collection – efficient performance data attribution • code-centric attribution • data-centric attribution Take HPCToolkit for example ! “A Data-centric Profiler for Parallel Programs”. Liu and Mellor-Crummey, SC’13 6

HPCToolkit: Attributing Samples static ! variable range heap allocated variables variables allocation path 0x0 0xff ... ... variable name malloc data-centric attribution code-centric attribution 7

HPCToolkit: Aggregating Profiles heap allocated heap allocated heap allocated variables variables variables allocation path allocation path allocation path ... ... ... malloc malloc malloc merge ... ... ... ... 8

LULESH on Platform of 8 NUMA Domains heap data:68% remote accesses z accounts for 7.7% remote accesses remote accesses z is allocated in a NUMA domain but allocation call path accessed by others interleave pages of z call site of allocation across NUMA nodes ! 13% improvement in running time call paths for accesses 9

Existing Measurement is Inadequate Data collection + attribution ≠ optimal optimization • – know problematic data objects but not know why – need more insights for optimization guidance – challenges in data analysis • not monitoring continuous memory accesses Approaches: data analysis for detailed optimization guidance • – NUMA locality • offline optimization (PPoPP’14) • online optimization – cache locality • array regrouping (PACT’14) • structure splitting • locality optimization between SMT threads – scalability of memory accesses 10

Interleaved Allocation is NOT Always Best domain 1 domain 2 domain 3 domain 4 core1 core2 core1 core2 core1 core2 core1 core2 core3 core4 core3 core4 core3 core4 core3 core4 allocation 1 centralized allocation: poor interleaved allocation: sub-optimal co-locate data with computation: optimal allocation 2 allocation 3 Goal: identify the best data distribution for a program 11 11

Memory Access Pattern Analysis array A Online data collection • ! domain1 domain2 domain3 domain4 ! array A allocate A blockwise to different domains ! [min, max] ! 0x00 0xff ! ! min max ! [min2, max2] [min, max] per sampled ! memory access [min1, max1] ! Offline analysis • – merge [min, max] intervals along call paths address – plot [min, max] for each thread • can be for any context, any variable balanced allocation + maximum locality T1 T2 T3 T4 12 12

Pinpointing First Touch Linux “first touch” policy • array A – memory allocation at first touch – if T1 first touches the whole range of A domain1 domain2 domain3 domain4 domain1 – if threads touch different segments of A 0x0 0xff Pinpoint “first touch” • – protect each variable’s pages at allocation heap allocated variable – first access to each variable traps allocation path ... first touch 13

LULESH on Platform of 8 NUMA Domains source code   first touches z special metrics common metrics domain ! domain ! 0 7 call path ! z accounts for 7.7% of allocates z remote accesses Block-wise allocation: 25% faster running time ! call paths ! Interleaved allocation: 13% faster running time access z call path   first touches z 14

Experiments: Architectures & Applications Architectures Sampling mechanisms Processors Threads Instruction-based sampling AMD Magny-Cours 48 IBS MRK Marked event sampling IBM POWER 7 128 Precise event-based sampling Intel Xeon Harpertown 8 PEBS Data event address registers Intel Itanium 2 8 DEAR PEBS-LL PEBS with load latency Intel Ivy Bridge 8 Benchmarks LLNL LANL Rodinia PARSEC SNL AMG2006 Sweep3D Streamcluster Blackscholes S3D LULESH NW Sphot UMT2013 optimized benchmarks 15 15

Optimization Results Programs Optimization Improvement for execution time AMG2006 NUMA locality 51% for the solver Sweep3D spatial locality 15% LULESH spatial+NUMA locality 25% Streamcluster NUMA locality 28% NW NUMA locality 53% UMT2013 NUMA locality 7% 16 16

Measurement Overhead Code- & data-centric analysis on POWER7 and Opteron Benchmark Configuration Overhead AMG2006 4 MPI * 128 threads 604s (+9.6%) Sweep3D 48 MPI 90s (+2.3%) LULESH 48 threads 19s (+12%) Streamcluster 128 threads 27s (+8.0%) NW 128 threads 80s (+3.9%) NUMA analysis: code-, data-, and address-centric analysis + first touch Methods LULESH AMG2006 Blacksholes IBS 295s (+24%) 89 (+37%) 192s (+6%) MRK 93s (+5%) 27s (+7%) 132s (+4%) PEBS 65s (+45%) 96s (+52%) 82s (+25%) DEAR 90s (+7%) 120s (+12%) 73s (+4%) PEBS-LL 35s (+6%) 57s (+8%) 67s (+3%) 17 17

Conclusions and Future Work Hardware address sampling • – widely supported in modern architectures – powerful in monitoring memory behaviors – currently in early stage of studies • focusing on data collection and attribution Potentials of hardware address sampling • – provide deeper insights than traditional performance counters – require novel analysis methods to expose performance insights ! Future work • – integrating address sampling into Charm++ runtime for online optimization 18

Leveraging Hardware Address Sampling ! Beyond Data Collection and - PowerPoint PPT Presentation

Leveraging Hardware Address Sampling ! Beyond Data Collection and Attribution Xu Liu ! Department of Computer Science College of William and Mary xl10@cs.wm.edu Motivation: Memory is the Bottleneck NUMA: Non-Uniform Memory Access core core

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

6 KEYNOTE ADDRESS SLIDES 7 KEYNOTE ADDRESS SLIDES 8 KEYNOTE ADDRESS SLIDES 9 KEYNOTE ADDRESS

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

P Starting at address 0, going to address MAX prog 0 But where do addresses come from? MOV

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Mussie Alemseghed, Ph.D. Oak Ridge Na6onal Laboratory Energy

and -XRD for the characterization and degradation of chrome yellow pigments: a focus on

Non-Homogeneous Hidden Markov Chain Models for Wavelet-Based Hyperspectral Image Processing

Minerals 1 Minerals Minerals are: Naturally occurring, Inorganic, Solid, Have

Virtual Clinics in Primary Care Community MSK Physiotherapy Grainne Duffin, Senior

Background Background Total ankle replacement has become a viable As implants improve and

Last time: iterated integrals (3 x 2 + 3 y 2 ) dA . Let D = [0 , 2] [ 3 , 1].

Clock lock Tree ee Res esynt nthes hesis is for or Mult ulti-cor i-corner ner Mult

Sambuz

Useful Links

Newsletter

Mail Us

Leveraging Hardware Address Sampling ! Beyond Data Collection and - PowerPoint PPT Presentation

Leveraging Hardware Address Sampling ! Beyond Data Collection and Attribution Xu Liu ! Department of Computer Science College of William and Mary xl10@cs.wm.edu Motivation: Memory is the Bottleneck NUMA: Non-Uniform Memory Access core core

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

6 KEYNOTE ADDRESS SLIDES 7 KEYNOTE ADDRESS SLIDES 8 KEYNOTE ADDRESS SLIDES 9 KEYNOTE ADDRESS

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

P Starting at address 0, going to address MAX prog 0 But where do addresses come from? MOV

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Mussie Alemseghed, Ph.D. Oak Ridge Na6onal Laboratory Energy

and -XRD for the characterization and degradation of chrome yellow pigments: a focus on

Non-Homogeneous Hidden Markov Chain Models for Wavelet-Based Hyperspectral Image Processing

Minerals 1 Minerals Minerals are: Naturally occurring, Inorganic, Solid, Have

Virtual Clinics in Primary Care Community MSK Physiotherapy Grainne Duffin, Senior

Background Background Total ankle replacement has become a viable As implants improve and

Last time: iterated integrals (3 x 2 + 3 y 2 ) dA . Let D = [0 , 2] [ 3 , 1].

Clock lock Tree ee Res esynt nthes hesis is for or Mult ulti-cor i-corner ner Mult

Sambuz

Useful Links

Newsletter

Mail Us

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling