GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in - PowerPoint PPT Presentation

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications Reena Nair Tony Field

What causes serialization bottlenecks? • Resource Contention • Load Imbalance Hardware Software Execution Time CPU Locks Peripherals Thread ID

Serialization Bottlenecks – Reduced Parallelism Barrier Core1 Thread 1 Thread 1 Thread 1 Core2 Thread 2 Thread 2 Thread 2 Thread 3 Core3 Thread 3 Thread 3 Core4 Thread 4 Thread 4 Thread 4 time Max Parallelism Reduced Parallelism Thread Blocked

Profilers for debugging performance issues Profiler Profiler Memory Locks A B Profiler Profiler Critical Peripherals D C Thread • There are many different sources of bottlenecks.

GAPP – Generic Automatic Parallel Profiler • Can identify several different types of serialization bottlenecks. • No need to instrument the application. • Validated on multithreaded and multi-process parallel applications written in C/C++. • Implemented using extended Berkley Packet Filter (eBPF). – Provides fast and secure kernel tracing (~4% average runtime overhead).

Harness the symptom rather than the cause • Identify when and where reduced parallelism is exhibited – Number of active threads, N act <= N min , a tuneable threshold variable with a default value of N/2, where N is the total number of threads • Trace context switch events in the kernel. – Retrieve stack trace at the end of a time slice Time Slice • Reduce overhead - retrieve Barrier stack traces only from Core1 Thread 1 critical time slices Core2 Thread 2 • Critical time-slice – whose average active thread count Core3 Thread 3 is <= N min Core4 • Omit ST 2 Thread 4 time 1 3 2 Active Threads 4 Reduced Parallelism ST 3 ST 2 Stack Traces (ST)

Are stack traces enough to identify bottleneck? • Stack traces retrieved at the end of a time-slice would point to bottleneck code only if it happened to execute at the end of a time-slice. Missed Bottleneck? Core1 Thread 1 Core2 Thread 2 Core3 Thread 3 Core4 Thread 4 Active Threads 2 1 2 1 1 3 Stack Traces (ST) ST 1 ST 4 ST 2 ST 3

Combining bottleneck code and call paths IP 1 IP 2 IP 3 (Periodic Samples) • Periodically sample T 1 Core1 instruction pointers. IP 4 IP 5 IP 6 X • Reject samples if N act > N min Core2 T 2 IP 7 X IP 8 IP 9 IP 10 • Combine instruction pointers Core3 T 3 and stack traces of each IP 11 X IP 12 IP 13 critical time-slice Core4 T 4 • Each critical time-slice is assigned a metric, Criticality Metric 1 ( Cmetric), which Active Threads 1 1 1 2 3 2 takes into account the duration and degree of IP 12 IP 8 IP 4 IP 1 parallelism of a time-slice. IP 9 IP 5 IP 13 IP 2 Stack Traces (ST) IP 10 ST 2 ST 4 IP 3 ST 3 ST 1 1 Du Bois, Kristof, et al. "Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior .“, ISCA ‘13

Ranking Bottlenecks • Similar call paths, their samples and CMetric are combined and sorted to display potential critical call paths, functions and lines of codes and Cmetric of individual threads. Critical Path 1: ThreadID CMetric deflate_slow() 25778 256130902 <---deflate() <---compress() 25779 417320962 <---Compress() Functions and lines + Frequency 25783 5003332502 deflate_slow – 1465 25784 5003756997 deflate.c:1650 (StackTop) -- 575 deflate.c:1580 -- 354 Load Imbalance, if any Optimization Opportunities

GAPP - Evaluation • Evaluated using applications from the Parsec-3.0 benchmark suite and two large open source projects, MySQL and Nektar++. • All applications except Nektar++ were multithreaded • Each was executed with 64 threads. • Nektar++, a spectral/hp element framework which uses message passing, was executed with 16 MPI processes.

Load imbalance from thread CMetric Multithreaded Task Parallel Application - Ferret • Six pipeline stages - first and last stages perform I/O with single threads. Feature Load Segmentation Indexing Ranking Out extraction 1 15 15 15 15 1 Fig: Ferret pipeline stages with initial thread allocation Functions and lines + Frequency Critical Path 1: isOptimal -- 41314 emd () emd.c:422 -- 20813 <---sdist_emd () emd.c:423 -- 10760 <---raw_query () emd.c:420 -- 6657 <---cass_table_query () findBasicVariables -- 41301 <---t_rank () emd.c:350 -- 7366 <---start_thread () emd.c:353 -- 6713 emd.c:383 -- 5827 Fig: GAPP Profile for Ferret

Optimizing Ferret by thread reallocation • Ranking phase exhibited higher CMetric when compared to other stages. • Optimized by re-allocating threads to ranking phase. Initial thread Thread Allocations allocation Run Time: 30s 15-15-15-15 CMetric Values 15-5-15-25 2-1-18-39 Run Time: 20s After Optmization Run Time: 15s Thread Index Fig: Cmetric for different thread allocations - Ferret

Resource Contention – MySQL Sysbench OLTP_read_write workload Critical Path1 Critical Path 2 fil_flush()[mysqld] sync_array_reserve_cell() <---log_write_up_to() <---rw_lock_s_lock_spin() <--trx_commit_complete_for_mysql() <---pfs_rw_lock_s_lock_func() <---innobase_commit() <---row_search_mvcc() <---ha_commit_low() <---ha_innobase::index_read() <---TC_LOG_DUMMY::commit() <---handler::ha_index_read_idx_map() <---ha_commit_trans() <---join_read_const_table() <---trans_commit() <---JOIN::extract_func_dependent_tables() <---mysql_execute_command() <---JOIN::make_join_plan() <---Prepared_statement::execute() <---JOIN::optimize() Functions and lines + frequency Functions and lines + frequency pfs_os_file_flush_func -- 1462 sync_array_reserve_cell() -- 469 sync0arr.cc:389 (StackTop) -- 469 os0file.ic:507 (StackTop) -- 1462 Spin-wait Loop Disk I/O

Optimizing MySQL Critical Function2 Critical Function1 (Software Resource Contention) (Hardware Resource Contention) • • pfs_os_file_flush_func() sync_array_reserve_cell() – Invoked by InnoDB, flushes – Invoked from a custom built write buffers to disk spin lock, that blocks after spinning for a predefined – Increasing buffer size time. improved transaction rate by – Increasing spin wait time 19% and reduced latency by 16% reduced cache misses by 10.6% These 2 modifications cumulatively improved query transaction rate by 34% and reduced average latency by 25%.

Bodytrack – Parsec3.0 Multithreaded application that follows producer-consumer paradigm Read next set of Update Images images from queue AsyncIO Thread Queue Send command to Estimate worker threads Critical Call Path1 void FlexDownSample2 () WritePose Pool of worker <---TrackingModel::OutputBMP() threads <---mainPthreads() <---main () Delegate to Writer OutputBMP Thread Improved performance by 22% Main Producer Loop

GAPP on MPI Applications • Nektar++ - a spectral/hp element framework that implements several PDE solvers. • Evaluated using the Incompressible Navier-Stokes Solver with 16 MPI processes. 12 • Load imbalance was found Normalised CMetric 10 to be due to non-uniform partitioning of the mesh. 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Task ID Fig:Cmetric of Individual Processes

GAPP Profile - Nektar ++ For each Combining functions and critical path lines from critical paths Critical Path 1: Top Critical Functions and lines + Frequency __GI___poll ()[libc-2.27.so] <---MPIDI_CH3I_Progress ()[libmpi.so.12.1.1] dgemv_ () [libblas.so.3.8.0] -- 781 <---MPIC_Wait ()[libmpi.so.12.1.1] <---MPIC_Recv ()[libmpi.so.12.1.1] double Vmath::Dot2<double>() <---MPIR_Bcast_binomial ()[libmpi.so.12.1.1] [libLibUtilities-g.so.5.0.0b] -- 170 <---MPIR_Bcast_intra ()[libmpi.so.12.1.1] <---MPIR_Bcast ()[libmpi.so.12.1.1] gather_double_add () <---MPIR_Bcast_impl ()[libmpi.so.12.1.1] [libMultiRegions-g.so.5.0.0b] -- 100 <---MPIR_Allreduce_intra ()[libmpi.so.12.1.1] <---MPIR_Allreduce_impl ()[libmpi.so.12.1.1] Functions and lines + Frequency dgemv_ () [libblas.so.3.8.0] -- 594 double Vmath::Dot2<double>() [libLibUtilities-g.so.5.0.0b] -- 116 gather_double_add () [libMultiRegions-g.so.5.0.0b] -- 58

Optimizing critical functions – Nektar++ Before Optimization After Optimization 80 60 Bottleneck Function 60 (dgemv) 40 Count Count 40 20 20 0 0 F1 F2 F3 F2 F4 F1 Function Name Function Name • Bottleneck Function – matrix multiplication routine exported by the BLAS library. • Replacing the default BLAS libraries with OpenBLAS improved run time by 27%.

Conclusion • GAPP was able to identify different types of serialization bottlenecks in different class of applications. • Robust – Consistent results across multiple runs under the same test condition. • Customizable – Tuneable parameters: N min , sampling frequency, stack depth, option to include results from dynamic libraries • Limitation – Will not work with spin- wait loops which doesn’t block. • Available at – https://github.com/RN-dev-repo/GAPP

Thank You

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in - PowerPoint PPT Presentation

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications Reena Nair Tony Field What causes serialization bottlenecks? Resource Contention Load Imbalance Hardware Software Execution Time CPU Locks

Session 9 Serialization/JSON 1 Lecture Objectives Understand the need for serialization

Session 14 Serialization/JSON 1 Lecture Objectives Understand the need for serialization

HawkTracer profiler Marcin Kolny Amazon Prime Video marcin.kolny@gmail.com February 2, 2020

IgProf The ignominious profiler. A generic memory and performance profiler for linux

Using The QML Profiler Ulf Hermann The Qt Company October 8, 2014 / Qt Developer Days 2014 1/28

Profiling Low-End Platforms using HawkTracer Profiler Marcin Kolny Amazon FOSDEM 2019 February

kotlinx.serialization 1.0 Leonid Startsev @sandwwraith October 13, 2020 Kotlin multiplatform /

Serialization ELCO Tianjin Dresden Oberstenfeld From the Sensor to the Human and back again

Data Data Serialization Serialization with with Symfony Symfony & & Drupal Drupal

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

Devel::NYTProf Perl Source Code Profiler Tim Bunce - July 2009 Screencast available at

Dynamic temperature profiler update Ranjan Dharmapalan, Alex Dvornikov, Jelena Maricic, Radovan

The .NET Profiling API OVERVIEW The .NET Profiler API is available since CLR/.NET Framework

Effjcient Message Serialization for Inter-Service Communication in dCache Evaluating a Replacement

Pharmaceutical Serialization: Track & Trace in the USA Aisha Nabiyeva David Wu Advisor:

Eliminating Bottlenecks in Overlay Multicast Min Sik Kim Yi Li Simon S. Lam Department

WTF? Locating Problems in Home Networks Srikanth Sundaresan Nick Feamster Georgia Tech Renata

Bridging the Gap Between Computational Narrative and Natural Language Processing Santiago

SMIC: Subflow-level Multi-path Interest Control for Information Centric Networking Munyoung Lee

Rapid Nonlinear Topology Optimization using Precomputed Reduced-Order Models Matthew J. Zahr and

Drinking From The Fire Hose: The Rise of Scalable Stream Processing Systems Peter Pietzuch

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu,

in Parallel DAG-based Data Flow Programs Bjrn Lohrmann Dominic Battr Matthias Hovestadt

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in - PowerPoint PPT Presentation

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications Reena Nair Tony Field What causes serialization bottlenecks? Resource Contention Load Imbalance Hardware Software Execution Time CPU Locks

Session 9 Serialization/JSON 1 Lecture Objectives Understand the need for serialization

Session 14 Serialization/JSON 1 Lecture Objectives Understand the need for serialization

HawkTracer profiler Marcin Kolny Amazon Prime Video marcin.kolny@gmail.com February 2, 2020

IgProf The ignominious profiler. A generic memory and performance profiler for linux

Using The QML Profiler Ulf Hermann The Qt Company October 8, 2014 / Qt Developer Days 2014 1/28

Profiling Low-End Platforms using HawkTracer Profiler Marcin Kolny Amazon FOSDEM 2019 February

kotlinx.serialization 1.0 Leonid Startsev @sandwwraith October 13, 2020 Kotlin multiplatform /

Serialization ELCO Tianjin Dresden Oberstenfeld From the Sensor to the Human and back again

Data Data Serialization Serialization with with Symfony Symfony &amp; &amp; Drupal Drupal

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

Devel::NYTProf Perl Source Code Profiler Tim Bunce - July 2009 Screencast available at

Dynamic temperature profiler update Ranjan Dharmapalan, Alex Dvornikov, Jelena Maricic, Radovan

The .NET Profiling API OVERVIEW The .NET Profiler API is available since CLR/.NET Framework

Effjcient Message Serialization for Inter-Service Communication in dCache Evaluating a Replacement

Pharmaceutical Serialization: Track &amp; Trace in the USA Aisha Nabiyeva David Wu Advisor:

Eliminating Bottlenecks in Overlay Multicast Min Sik Kim Yi Li Simon S. Lam Department

WTF? Locating Problems in Home Networks Srikanth Sundaresan Nick Feamster Georgia Tech Renata

Bridging the Gap Between Computational Narrative and Natural Language Processing Santiago

SMIC: Subflow-level Multi-path Interest Control for Information Centric Networking Munyoung Lee

Rapid Nonlinear Topology Optimization using Precomputed Reduced-Order Models Matthew J. Zahr and

Drinking From The Fire Hose: The Rise of Scalable Stream Processing Systems Peter Pietzuch

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu,

in Parallel DAG-based Data Flow Programs Bjrn Lohrmann Dominic Battr Matthias Hovestadt

Data Data Serialization Serialization with with Symfony Symfony & & Drupal Drupal

Pharmaceutical Serialization: Track & Trace in the USA Aisha Nabiyeva David Wu Advisor: