Performance analysis with Periscope M. Gerndt, V. Petkov, Y. - - PowerPoint PPT Presentation

performance analysis with periscope
SMART_READER_LITE
LIVE PREVIEW

Performance analysis with Periscope M. Gerndt, V. Petkov, Y. - - PowerPoint PPT Presentation

Technische Universitt Mnchen Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universitt Mnchen September 2010 Technische Universitt Mnchen Outline Motivation Periscope architecture


slide-1
SLIDE 1

Technische Universität München

Performance analysis with Periscope

  • M. Gerndt, V. Petkov,
  • Y. Oleynik, S. Benedict

Technische Universität München September 2010

slide-2
SLIDE 2

Technische Universität München

Outline

Motivation Periscope architecture Periscope performance analysis model Performance analysis strategies in Periscope Periscope GUI

slide-3
SLIDE 3

Technische Universität München

Common performance analysis procedure on Power6 systems

Use Tprof to pinpoint time-consuming subroutines Use Xprofiler to understand call graph; mpitrace for MPI comm Use hpmcount (libhpm) to measure HW Counters

Problem

Routine, error-prone and time-consuming Requires deep HW knowledge Mostly post-development process Hard to map bottlenecks to their source code location

Solution

Automate the performance analysis Integrate parallel application development and performance analysis within the same IDE

Motivation

slide-4
SLIDE 4

Technische Universität München

Periscope

Iterative online analysis

Measurements are configured, obtained and evaluated on the fly no need to store trace files

Distributed architecture

Reduced network overhead Analysis performed by multiple distributed hierarchical agents

Automatic bottlenecks search

Based on performance optimization experts' knowledge Single-node Performance on Intel Itanium6, IBM Power6, x86-s MPI Communication OpenMP Performance

Enhanced Eclipse-based GUI Instrumentation: Fortran, C/C++; MPI / OpenMP / Hybrid

slide-5
SLIDE 5

Technische Universität München

Distributed architecture

Graphical User Interface Application Interactive frontend

Eclipse-based GUI Analysis control Agents network Monitoring Request Interface

slide-6
SLIDE 6

Technische Universität München

Candidate Properties

Proven Properties

Analysis

Performance Measurements

Refinement

Raw performance data

Instrumented Application Analysis Agents GUI

Precision Location

Monitoring Requests

Start

Final Properties Report

slide-7
SLIDE 7

Technische Universität München

Automatic search for bottlenecks

Automation based on formalized expert knowledge

Efficient search algorithms strategies

Performance property

Condition Confidence Severity

Performance analysis strategies

Itanium2 Stall Cycle Analysis IBM POWER6 Single Core Performance Analysis MPI Communication Pattern Analysis Generic Memory Strategy OpenMP-based Performance Analysis Scalability Analysis OpenMP codes

slide-8
SLIDE 8

Technische Universität München

POWER6 Single Core Performance Properties

Hot spot of the application

Cycles lost due to cache misses Average amount of cycles lost per L1 miss High L1 demand load miss rate High L2 demand load miss rate High L3 demand load miss rate Cycles lost due to address translation misses Cycles lost due to store instructions Cycles lost due to Floating Point instructions inefficiencies Cycles lost due to Integer multiplications and divisions Cycles lost due to no instruction to dispatch

slide-9
SLIDE 9

Technische Universität München

Itanium2 Stall Cycle Properties

IA64 Pipeline Stall Cycles

Stalls due to pipeline flush Stalls due to branch misprediction flush Stalls due to exception flush Stalls due to floating point exceptions or L1D TLB misses Stalls due to Flush to zero or SIR stalls Stalls due to L1D TLB misses ... Stalls due to waiting for data delivery to register Stalls due to waiting for integer register Stalls due to waiting for integer results Stalls due to waiting for FP register Stalls due to waiting for integer loads L3 misses dominate data access L2 misses L3 misses Stalls due to register stack engine

slide-10
SLIDE 10

Technische Universität München

MPI Communication Patterns Analysis

MPI_Recv MPI_Send

p1 p2

Automatic detection of wait patterns Measurement on the fly No tracing required!

slide-11
SLIDE 11

Technische Universität München

MPI Performance Properties

Excessive MPI time in receive due to late sender Excessive MPI time due to late root in broadcast Excessive MPI time in root due to late process in reduce Excessive MPI time in ... (1xN, Nx1, 1x1, NxN) Excessive MPI time due to many small messages Excessive MPI communication time

slide-12
SLIDE 12

Technische Universität München

OpenMP-based Performance Properties

Searches OpenMP-based perf. problems in a single step Properties are divided into four major domains

Startup and Shutdown Overhead Load Imbalance in OpenMP regions: Parallel region Parallel loop Explicit barrier Parallel sections Not enough sections Uneven sections

  • Seq. Computation in parallel regions:

Master region Single region Ordered loop OpenMP Synchronization properties Critical section overhead property Frequent atomic property

OpenMP Tasking analysis under development

slide-13
SLIDE 13

Technische Universität München

Scalability Analysis OpenMP codes

Identifies the OpenMP code regions that do not scale well Scalability Analysis is done by the frontend No need to manually configure the runs and find the speedup!

Frontend initialization

Frontend.run()

  • i. Starts application

ii.Starts analysis agents iii.Receives found properties

n

Extracts information from the found properties Does Scalability Analysis Exports the Properties GUI-based Analysis After 2n runs

slide-14
SLIDE 14

Technische Universität München

Scalability Analysis Properties

Meta-Properties

Properties occurring in all configurations Property with increasing severity across the configurations

Speedup-based Prop.

Linear Speedup Super linear Speedup Linear Speedup failed for the first time Speedup Decreasing

  • Exp. specific properties

Code region with the lowest speedup Low Speedup based on a threshold

slide-15
SLIDE 15

Technische Universität München

Graphical User Interface

Integrates with the Eclipse Development Platform

Open-source, extensible and very popular IDE Supports different programming languages: C/C++, Fortran, etc. Uses the Eclipse Parallel Tools Platform (PTP) which provides a higher-level abstraction of the underlying parallel system

Designed to combine:

Performance measurement functionality of Periscope Advanced IDE functions like code indexing, refactoring, etc.

Features

Multi-functional table to display the detected bottlenecks Outline of the instrumented code regions Clustering techniques to get classes of similarly behaving processes Supports both local and remote projects Higher-level configuration and execution of performance experiments

slide-16
SLIDE 16

Technische Universität München

Project view Source code view Properties view SIR outline view

Graphical User Interface

slide-17
SLIDE 17

Technische Universität München

Simple and clean tree-based overview Multi-level grouping Complex data filtering Multiple criteria sorting algorithm Navigation from the properties to their source code location

Periscope GUI: Properties Table

slide-18
SLIDE 18

Technische Universität München

Resembles the code outline view of the Eclipse C/C++ Development Tooling Outlines the instrumented code regions and their nesting Shows the number of properties in each region Assists code navigation Filters the displayed properties

Periscope GUI: Instrumentation Outline

slide-19
SLIDE 19

Technische Universität München

Eclipse File System (EFS)

Abstracts the underlying file system details Any supported file system can be used: Remote projects using SSH/FTP/DStore, Local, Zip, etc. Source files of the analyzed application reside only on the remote no need for synchronization

Remote Development Tools (RDT)

Part of Eclipse Parallel Tools Platform (PTP) Project Remote Compilation Remote Indexing Currently supports only C/C++ applications

Periscope GUI: RDT and EFS

slide-20
SLIDE 20

Technische Universität München

External Tools Framework (ETFw)

Part of Eclipse Parallel Tools Platform (PTP) Project More convenient environment using ETFw's Profile launch configuration

no terminal access needed higher level configuration and automation possible

Periscope GUI: Experiment Configuration

slide-21
SLIDE 21

Technische Universität München

Clustering support

Properties summarization

Metaproperties

Identify hidden behavior Weka workbench in the GUI:

Waikato Environment for Knowledge Analysis Uses K-Means algorithm Groups properties based on CPU distribution and code region

Results shown in a table view similar to the properties view Done also on the fly in the hierarchy

Property 1 Property 2 Property 3 Cluster 1 Cluster 2 CPUs: 2-3,5,11,13-14 Cluster 3 Cluster 1 CPUs: 7-10,16 Cluster 3 CPUs:1,4,6,12,15 Cluster 2

slide-22
SLIDE 22

Technische Universität München

Thank you for your attention!

Current version 1.3 (New BSD License)

Available under: http://www.lrr.in.tum.de/periscope/Download

Supported architectures

SGI Altix 4700 Itanium2 IBM Power575 POWER6 x86-based architectures BlueGene/P under development

Further information:

Periscope web page: http://www.lrr.in.tum.de/periscope