GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in - - PowerPoint PPT Presentation

gapp a fast profiler for detecting serialization
SMART_READER_LITE
LIVE PREVIEW

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in - - PowerPoint PPT Presentation

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications Reena Nair Tony Field What causes serialization bottlenecks? Resource Contention Load Imbalance Hardware Software Execution Time CPU Locks


slide-1
SLIDE 1

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

Reena Nair Tony Field

slide-2
SLIDE 2
  • Resource Contention

What causes serialization bottlenecks?

  • Load Imbalance

Hardware

CPU Peripherals

Software

Locks

Execution Time Thread ID

slide-3
SLIDE 3

Serialization Bottlenecks – Reduced Parallelism

Thread1 Thread2 Thread3 Thread4 Thread1 Thread2 Thread3 Thread4 Thread1 Thread2 Thread3 Thread4 Max Parallelism Reduced Parallelism

Barrier

time Core1 Core2 Core3 Core4 Thread Blocked

slide-4
SLIDE 4
  • There are many different sources of bottlenecks.

Profiler B Locks Profiler D Critical Thread Memory Profiler A Profiler C Peripherals

Profilers for debugging performance issues

slide-5
SLIDE 5
  • Can identify several different types of serialization bottlenecks.
  • No need to instrument the application.
  • Validated on multithreaded and multi-process parallel applications

written in C/C++.

  • Implemented using extended Berkley Packet Filter (eBPF).

– Provides fast and secure kernel tracing (~4% average runtime

  • verhead).

GAPP – Generic Automatic Parallel Profiler

slide-6
SLIDE 6

Harness the symptom rather than the cause

  • Identify when and where reduced parallelism is exhibited

– Number of active threads, Nact <= Nmin , a tuneable threshold variable with a default value of N/2, where N is the total number of threads

  • Trace context switch events in the kernel.

– Retrieve stack trace at the end of a time slice

Stack Traces (ST)

  • Reduce overhead - retrieve

stack traces only from critical time slices

  • Critical time-slice – whose

average active thread count is <= Nmin

  • Omit ST2

Reduced Parallelism Time Slice time Thread1 Thread2 Thread3 Thread4 Core1 Core2 Core3 Core4 ST3 1 2 3 4 ST2 Active Threads Barrier

slide-7
SLIDE 7
  • Stack traces retrieved at the end of a time-slice would point to bottleneck

code only if it happened to execute at the end of a time-slice.

Thread1 Thread2 Thread3 Thread4 Missed Bottleneck? ST1 ST2 ST4

Are stack traces enough to identify bottleneck?

Active Threads 1 1 2 3 2 Core1 Core2 Core3 Core4 Stack Traces (ST) ST3 1

slide-8
SLIDE 8

T1 T2 T3 T4

IP1 IP2 IP3 ST1 IP4 IP5 ST2 IP8 IP9 IP10 ST3 IP12 IP13 ST4

IP1 IP2 IP3 (Periodic Samples)

X X X

  • Periodically sample

instruction pointers.

  • Reject samples if Nact >Nmin
  • Combine instruction pointers

and stack traces of each critical time-slice

  • Each critical time-slice is

assigned a metric, Criticality Metric1 (Cmetric), which takes into account the duration and degree of parallelism of a time-slice.

1Du Bois, Kristof, et al. "Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior.“, ISCA ‘13

Combining bottleneck code and call paths

Core1 Core2 Core3 Core4 IP4 IP5 IP6 IP7 IP8 IP9 IP10 IP11 IP12 IP13 Active Threads 1 1 2 3 2 Stack Traces (ST) 1

slide-9
SLIDE 9

Ranking Bottlenecks

  • Similar call paths, their samples and CMetric are combined and sorted to

display potential critical call paths, functions and lines of codes and Cmetric of individual threads. ThreadID CMetric 25778 256130902 25779 417320962 25783 5003332502 25784 5003756997 Load Imbalance, if any Critical Path 1: deflate_slow() <---deflate() <---compress() <---Compress() Functions and lines + Frequency deflate_slow – 1465 deflate.c:1650 (StackTop) -- 575 deflate.c:1580 -- 354 Optimization Opportunities

slide-10
SLIDE 10

GAPP - Evaluation

  • Evaluated using applications from the Parsec-3.0 benchmark suite and

two large open source projects, MySQL and Nektar++.

  • All applications except Nektar++ were multithreaded
  • Each was executed with 64 threads.
  • Nektar++, a spectral/hp element framework which uses message passing,

was executed with 16 MPI processes.

slide-11
SLIDE 11

Load imbalance from thread CMetric

Multithreaded Task Parallel Application - Ferret

  • Six pipeline stages - first and last stages perform I/O with single threads.

Segmentation Feature extraction Indexing Ranking Load Out 1 15 15 15 15 1 Fig: Ferret pipeline stages with initial thread allocation Critical Path 1: emd () <---sdist_emd () <---raw_query () <---cass_table_query () <---t_rank () <---start_thread () Functions and lines + Frequency isOptimal -- 41314 emd.c:422 -- 20813 emd.c:423 -- 10760 emd.c:420 -- 6657 findBasicVariables -- 41301 emd.c:350 -- 7366 emd.c:353 -- 6713 emd.c:383 -- 5827 Fig: GAPP Profile for Ferret

slide-12
SLIDE 12

Optimizing Ferret by thread reallocation

Initial thread allocation After Optmization

CMetric Values Thread Index 15-15-15-15 15-5-15-25 2-1-18-39 Run Time: 30s Run Time: 20s Run Time: 15s Thread Allocations

  • Ranking phase exhibited higher CMetric when compared to other stages.
  • Optimized by re-allocating threads to ranking phase.

Fig: Cmetric for different thread allocations - Ferret

slide-13
SLIDE 13

Resource Contention – MySQL

Sysbench OLTP_read_write workload

fil_flush()[mysqld] <---log_write_up_to() <--trx_commit_complete_for_mysql() <---innobase_commit() <---ha_commit_low() <---TC_LOG_DUMMY::commit() <---ha_commit_trans() <---trans_commit() <---mysql_execute_command() <---Prepared_statement::execute() Functions and lines + frequency pfs_os_file_flush_func -- 1462

  • s0file.ic:507 (StackTop) -- 1462

Critical Path1 Critical Path 2

sync_array_reserve_cell() <---rw_lock_s_lock_spin() <---pfs_rw_lock_s_lock_func() <---row_search_mvcc() <---ha_innobase::index_read() <---handler::ha_index_read_idx_map() <---join_read_const_table() <---JOIN::extract_func_dependent_tables() <---JOIN::make_join_plan() <---JOIN::optimize() Functions and lines + frequency sync_array_reserve_cell() -- 469 sync0arr.cc:389 (StackTop) -- 469

Disk I/O Spin-wait Loop

slide-14
SLIDE 14

Optimizing MySQL

  • pfs_os_file_flush_func()

– Invoked by InnoDB, flushes write buffers to disk – Increasing buffer size improved transaction rate by 19% and reduced latency by 16%

  • sync_array_reserve_cell()

– Invoked from a custom built spin lock, that blocks after spinning for a predefined time. – Increasing spin wait time reduced cache misses by 10.6% These 2 modifications cumulatively improved query transaction rate by 34% and reduced average latency by 25%. Critical Function1 Critical Function2 (Hardware Resource Contention) (Software Resource Contention)

slide-15
SLIDE 15

Bodytrack – Parsec3.0

Multithreaded application that follows producer-consumer paradigm

Update Estimate WritePose OutputBMP Read next set of images from queue Send command to worker threads Pool of worker threads Images Queue AsyncIO Thread Writer Thread Delegate to Critical Call Path1 void FlexDownSample2 () <---TrackingModel::OutputBMP() <---mainPthreads() <---main () Main Producer Loop

Improved performance by 22%

slide-16
SLIDE 16

GAPP on MPI Applications

2 4 6 8 10 12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Normalised CMetric Task ID

  • Nektar++ - a spectral/hp element framework that implements several PDE

solvers.

  • Evaluated using the Incompressible Navier-Stokes Solver with 16 MPI

processes.

  • Load imbalance was found

to be due to non-uniform partitioning of the mesh.

Fig:Cmetric of Individual Processes

slide-17
SLIDE 17

GAPP Profile - Nektar ++

Critical Path 1: __GI___poll ()[libc-2.27.so] <---MPIDI_CH3I_Progress ()[libmpi.so.12.1.1] <---MPIC_Wait ()[libmpi.so.12.1.1] <---MPIC_Recv ()[libmpi.so.12.1.1] <---MPIR_Bcast_binomial ()[libmpi.so.12.1.1] <---MPIR_Bcast_intra ()[libmpi.so.12.1.1] <---MPIR_Bcast ()[libmpi.so.12.1.1] <---MPIR_Bcast_impl ()[libmpi.so.12.1.1] <---MPIR_Allreduce_intra ()[libmpi.so.12.1.1] <---MPIR_Allreduce_impl ()[libmpi.so.12.1.1] Functions and lines + Frequency dgemv_ () [libblas.so.3.8.0] -- 594 double Vmath::Dot2<double>() [libLibUtilities-g.so.5.0.0b] -- 116 gather_double_add () [libMultiRegions-g.so.5.0.0b] -- 58 Top Critical Functions and lines + Frequency dgemv_ () [libblas.so.3.8.0] -- 781 double Vmath::Dot2<double>() [libLibUtilities-g.so.5.0.0b] -- 170 gather_double_add () [libMultiRegions-g.so.5.0.0b] -- 100 For each critical path Combining functions and lines from critical paths

slide-18
SLIDE 18

Optimizing critical functions – Nektar++

20 40 60 80 F1 F2 F3

Count Function Name

Before Optimization After Optimization

Bottleneck Function (dgemv)

  • Bottleneck Function – matrix multiplication routine exported by the BLAS library.
  • Replacing the default BLAS libraries with OpenBLAS improved run time by

27%.

20 40 60 F2 F4 F1

Count Function Name

slide-19
SLIDE 19

Conclusion

  • GAPP was able to identify different types of serialization bottlenecks in

different class of applications.

  • Robust

– Consistent results across multiple runs under the same test condition.

  • Customizable

– Tuneable parameters: Nmin, sampling frequency, stack depth, option to include results from dynamic libraries

  • Limitation

– Will not work with spin-wait loops which doesn’t block.

  • Available at

– https://github.com/RN-dev-repo/GAPP

slide-20
SLIDE 20

Thank You