gapp a fast profiler for detecting serialization
play

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in - PowerPoint PPT Presentation

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications Reena Nair Tony Field What causes serialization bottlenecks? Resource Contention Load Imbalance Hardware Software Execution Time CPU Locks


  1. GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications Reena Nair Tony Field

  2. What causes serialization bottlenecks? • Resource Contention • Load Imbalance Hardware Software Execution Time CPU Locks Peripherals Thread ID

  3. Serialization Bottlenecks – Reduced Parallelism Barrier Core1 Thread 1 Thread 1 Thread 1 Core2 Thread 2 Thread 2 Thread 2 Thread 3 Core3 Thread 3 Thread 3 Core4 Thread 4 Thread 4 Thread 4 time Max Parallelism Reduced Parallelism Thread Blocked

  4. Profilers for debugging performance issues Profiler Profiler Memory Locks A B Profiler Profiler Critical Peripherals D C Thread • There are many different sources of bottlenecks.

  5. GAPP – Generic Automatic Parallel Profiler • Can identify several different types of serialization bottlenecks. • No need to instrument the application. • Validated on multithreaded and multi-process parallel applications written in C/C++. • Implemented using extended Berkley Packet Filter (eBPF). – Provides fast and secure kernel tracing (~4% average runtime overhead).

  6. Harness the symptom rather than the cause • Identify when and where reduced parallelism is exhibited – Number of active threads, N act <= N min , a tuneable threshold variable with a default value of N/2, where N is the total number of threads • Trace context switch events in the kernel. – Retrieve stack trace at the end of a time slice Time Slice • Reduce overhead - retrieve Barrier stack traces only from Core1 Thread 1 critical time slices Core2 Thread 2 • Critical time-slice – whose average active thread count Core3 Thread 3 is <= N min Core4 • Omit ST 2 Thread 4 time 1 3 2 Active Threads 4 Reduced Parallelism ST 3 ST 2 Stack Traces (ST)

  7. Are stack traces enough to identify bottleneck? • Stack traces retrieved at the end of a time-slice would point to bottleneck code only if it happened to execute at the end of a time-slice. Missed Bottleneck? Core1 Thread 1 Core2 Thread 2 Core3 Thread 3 Core4 Thread 4 Active Threads 2 1 2 1 1 3 Stack Traces (ST) ST 1 ST 4 ST 2 ST 3

  8. Combining bottleneck code and call paths IP 1 IP 2 IP 3 (Periodic Samples) • Periodically sample T 1 Core1 instruction pointers. IP 4 IP 5 IP 6 X • Reject samples if N act > N min Core2 T 2 IP 7 X IP 8 IP 9 IP 10 • Combine instruction pointers Core3 T 3 and stack traces of each IP 11 X IP 12 IP 13 critical time-slice Core4 T 4 • Each critical time-slice is assigned a metric, Criticality Metric 1 ( Cmetric), which Active Threads 1 1 1 2 3 2 takes into account the duration and degree of IP 12 IP 8 IP 4 IP 1 parallelism of a time-slice. IP 9 IP 5 IP 13 IP 2 Stack Traces (ST) IP 10 ST 2 ST 4 IP 3 ST 3 ST 1 1 Du Bois, Kristof, et al. "Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior .“, ISCA ‘13

  9. Ranking Bottlenecks • Similar call paths, their samples and CMetric are combined and sorted to display potential critical call paths, functions and lines of codes and Cmetric of individual threads. Critical Path 1: ThreadID CMetric deflate_slow() 25778 256130902 <---deflate() <---compress() 25779 417320962 <---Compress() Functions and lines + Frequency 25783 5003332502 deflate_slow – 1465 25784 5003756997 deflate.c:1650 (StackTop) -- 575 deflate.c:1580 -- 354 Load Imbalance, if any Optimization Opportunities

  10. GAPP - Evaluation • Evaluated using applications from the Parsec-3.0 benchmark suite and two large open source projects, MySQL and Nektar++. • All applications except Nektar++ were multithreaded • Each was executed with 64 threads. • Nektar++, a spectral/hp element framework which uses message passing, was executed with 16 MPI processes.

  11. Load imbalance from thread CMetric Multithreaded Task Parallel Application - Ferret • Six pipeline stages - first and last stages perform I/O with single threads. Feature Load Segmentation Indexing Ranking Out extraction 1 15 15 15 15 1 Fig: Ferret pipeline stages with initial thread allocation Functions and lines + Frequency Critical Path 1: isOptimal -- 41314 emd () emd.c:422 -- 20813 <---sdist_emd () emd.c:423 -- 10760 <---raw_query () emd.c:420 -- 6657 <---cass_table_query () findBasicVariables -- 41301 <---t_rank () emd.c:350 -- 7366 <---start_thread () emd.c:353 -- 6713 emd.c:383 -- 5827 Fig: GAPP Profile for Ferret

  12. Optimizing Ferret by thread reallocation • Ranking phase exhibited higher CMetric when compared to other stages. • Optimized by re-allocating threads to ranking phase. Initial thread Thread Allocations allocation Run Time: 30s 15-15-15-15 CMetric Values 15-5-15-25 2-1-18-39 Run Time: 20s After Optmization Run Time: 15s Thread Index Fig: Cmetric for different thread allocations - Ferret

  13. Resource Contention – MySQL Sysbench OLTP_read_write workload Critical Path1 Critical Path 2 fil_flush()[mysqld] sync_array_reserve_cell() <---log_write_up_to() <---rw_lock_s_lock_spin() <--trx_commit_complete_for_mysql() <---pfs_rw_lock_s_lock_func() <---innobase_commit() <---row_search_mvcc() <---ha_commit_low() <---ha_innobase::index_read() <---TC_LOG_DUMMY::commit() <---handler::ha_index_read_idx_map() <---ha_commit_trans() <---join_read_const_table() <---trans_commit() <---JOIN::extract_func_dependent_tables() <---mysql_execute_command() <---JOIN::make_join_plan() <---Prepared_statement::execute() <---JOIN::optimize() Functions and lines + frequency Functions and lines + frequency pfs_os_file_flush_func -- 1462 sync_array_reserve_cell() -- 469 sync0arr.cc:389 (StackTop) -- 469 os0file.ic:507 (StackTop) -- 1462 Spin-wait Loop Disk I/O

  14. Optimizing MySQL Critical Function2 Critical Function1 (Software Resource Contention) (Hardware Resource Contention) • • pfs_os_file_flush_func() sync_array_reserve_cell() – Invoked by InnoDB, flushes – Invoked from a custom built write buffers to disk spin lock, that blocks after spinning for a predefined – Increasing buffer size time. improved transaction rate by – Increasing spin wait time 19% and reduced latency by 16% reduced cache misses by 10.6% These 2 modifications cumulatively improved query transaction rate by 34% and reduced average latency by 25%.

  15. Bodytrack – Parsec3.0 Multithreaded application that follows producer-consumer paradigm Read next set of Update Images images from queue AsyncIO Thread Queue Send command to Estimate worker threads Critical Call Path1 void FlexDownSample2 () WritePose Pool of worker <---TrackingModel::OutputBMP() threads <---mainPthreads() <---main () Delegate to Writer OutputBMP Thread Improved performance by 22% Main Producer Loop

  16. GAPP on MPI Applications • Nektar++ - a spectral/hp element framework that implements several PDE solvers. • Evaluated using the Incompressible Navier-Stokes Solver with 16 MPI processes. 12 • Load imbalance was found Normalised CMetric 10 to be due to non-uniform partitioning of the mesh. 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Task ID Fig:Cmetric of Individual Processes

  17. GAPP Profile - Nektar ++ For each Combining functions and critical path lines from critical paths Critical Path 1: Top Critical Functions and lines + Frequency __GI___poll ()[libc-2.27.so] <---MPIDI_CH3I_Progress ()[libmpi.so.12.1.1] dgemv_ () [libblas.so.3.8.0] -- 781 <---MPIC_Wait ()[libmpi.so.12.1.1] <---MPIC_Recv ()[libmpi.so.12.1.1] double Vmath::Dot2<double>() <---MPIR_Bcast_binomial ()[libmpi.so.12.1.1] [libLibUtilities-g.so.5.0.0b] -- 170 <---MPIR_Bcast_intra ()[libmpi.so.12.1.1] <---MPIR_Bcast ()[libmpi.so.12.1.1] gather_double_add () <---MPIR_Bcast_impl ()[libmpi.so.12.1.1] [libMultiRegions-g.so.5.0.0b] -- 100 <---MPIR_Allreduce_intra ()[libmpi.so.12.1.1] <---MPIR_Allreduce_impl ()[libmpi.so.12.1.1] Functions and lines + Frequency dgemv_ () [libblas.so.3.8.0] -- 594 double Vmath::Dot2<double>() [libLibUtilities-g.so.5.0.0b] -- 116 gather_double_add () [libMultiRegions-g.so.5.0.0b] -- 58

  18. Optimizing critical functions – Nektar++ Before Optimization After Optimization 80 60 Bottleneck Function 60 (dgemv) 40 Count Count 40 20 20 0 0 F1 F2 F3 F2 F4 F1 Function Name Function Name • Bottleneck Function – matrix multiplication routine exported by the BLAS library. • Replacing the default BLAS libraries with OpenBLAS improved run time by 27%.

  19. Conclusion • GAPP was able to identify different types of serialization bottlenecks in different class of applications. • Robust – Consistent results across multiple runs under the same test condition. • Customizable – Tuneable parameters: N min , sampling frequency, stack depth, option to include results from dynamic libraries • Limitation – Will not work with spin- wait loops which doesn’t block. • Available at – https://github.com/RN-dev-repo/GAPP

  20. Thank You

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend