Martin Schulz Lawrence Livermore National Laboratory VAPLS 2013, - - PowerPoint PPT Presentation

martin schulz
SMART_READER_LITE
LIVE PREVIEW

Martin Schulz Lawrence Livermore National Laboratory VAPLS 2013, - - PowerPoint PPT Presentation

Alfredo Gimenez University of California at Davis Martin Schulz Lawrence Livermore National Laboratory VAPLS 2013, Atlanta October 14 th , 2013 LLNL-PRES-xxxxxx Single view on data is insufficient Different patterns emerge in different


slide-1
SLIDE 1

VAPLS 2013, Atlanta  October 14th, 2013

LLNL-PRES-xxxxxx

Alfredo Gimenez University of California at Davis Martin Schulz Lawrence Livermore National Laboratory

slide-2
SLIDE 2

Working in the Application Domain Alfredo Gimenez and Martin Schulz

Single view on data is insufficient

  • Different patterns emerge in different

domains

  • Patterns help identify performance problems

Map data from one domain to

  • ne of the other domains
  • Comparable data
  • Enable correlation
  • Understand interactions
  • Access to visualization techniques

Application Domain

  • Intuitive to the application scientist
  • Can employ similar sci viz

techniques

Physical layout

  • f hardware

Communication Patterns Physical Simulation Data

slide-3
SLIDE 3

Working in the Application Domain Alfredo Gimenez and Martin Schulz

3 4 5 667 89 :" " 3 >5 ?@ 2 9 :"

Temperature

Example: 256 core run of a CFD application, MIRANDA

  • Floating point operations

Simple step:

  • Map floating point ops onto

the application domain

  • Similar L2 cache misses

Apparent correlations

  • Explains performance
  • Application-specific

bottlenecks

! " #" $" % " & " ' " ( " ) " *" +" #! " ##" #$" #% " #& " #' " #( " #) " #*" #+" $! " $#" $$" $% " $& " $' " $( " $) " $*" $+" % ! " % #" ! " #" $" % " & " ' " ( " ) " ! " #" $" % " & " ' " ( " ) " *" +" #! " ##" #$" #% " #& " #' " #( " #) " #*" #+" $! " $#" $$" $% " $& " $' " $( " $) " $*" $+" % ! " % #" ! " #" $" % " & " ' " ( " ) "

Floating Point L2CM

slide-4
SLIDE 4

Working in the Application Domain Alfredo Gimenez and Martin Schulz

Aluminum Velocity L1CM Time FP Ops BranchMiss

slide-5
SLIDE 5

Working in the Application Domain Alfredo Gimenez and Martin Schulz

L1CM

  • Observation: one core per node consistently creates more L1 misses
  • Caused by the execution of collective MPI operations
  • Shows the need for different perspectives to disambiguate causes
  • Feature detection and correlation can automate this process

App Domain HW Domain: 16 nodes with 4x4 cores

slide-6
SLIDE 6

Working in the Application Domain Alfredo Gimenez and Martin Schulz

L1 Cache Misses FP Operations Same data with linear color map

slide-7
SLIDE 7

Working in the Application Domain Alfredo Gimenez and Martin Schulz

L1 Cache Misses with MPI worker filtered FP Operations Same data with linear color map

slide-8
SLIDE 8

Working in the Application Domain Alfredo Gimenez and Martin Schulz

L1 Cache Misses with MPI worker filtered FP Operations Same data with linear color map L1 Misses per FP operation: Proxy for efficiency

slide-9
SLIDE 9

Working in the Application Domain Alfredo Gimenez and Martin Schulz

Previous example has coarse granularity

  • MIRANDA example uses per-core performance data
  • Each core is responsible for a portion of the application domain
  • Need finer-grained data and more general mapping techniques

Question: can we get access to finer grained data?

  • Ideal: per data point measurements

— Hard to track in hardware (in all details) — Hardware simulation has high overhead and is most probably

inaccurate

  • Approach: exploit new hardware sampling techniques and develop

mechanisms to provide application domain mappings

slide-10
SLIDE 10

Working in the Application Domain Alfredo Gimenez and Martin Schulz

Target application: LULESH

  • Shock Hydrodynamics challenge problem
  • Unstructured hex mesh
  • Implemented in a wide range of models

PEBS counters

  • Sampling of memory loads
  • Load address, time to load, cache hierarchy
  • Enables mapping back to data structure

Experiments

  • OpenMP version of LULESH
  • 4 Core Intel IvyBridge

Calculation step: 2051 Calculation step: 616

slide-11
SLIDE 11

Working in the Application Domain Alfredo Gimenez and Martin Schulz

Total Cycles

Compulsory cache misses at first element

slide-12
SLIDE 12

Working in the Application Domain Alfredo Gimenez and Martin Schulz

Total Cycles

Cache misses due to thread-level parallelism

slide-13
SLIDE 13

Working in the Application Domain Alfredo Gimenez and Martin Schulz

Performance data can be measured in many domains

  • Need to correlate domains
  • Need visualization techniques

in each domain

Application domain

  • Intuitive for the user
  • Exploit existing tools

Challenges moving forward

  • Automatic analysis within different domains (e.g. feature detection)
  • Emerging domains—multivariate, infoviz